Only this pageAll pages
Powered by GitBook
Couldn't generate the PDF for 216 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

v7.2.4

Loading...

Overview

Loading...

Loading...

Installation and Configuration

System Requirements

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading and Exporting Data

Supported Data Sources

Loading...

Loading...

Loading...

Command Line

Loading...

Loading...

SQL

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Heavy Immerse

Loading...

Loading...

Loading...

Hardware Reference

Optimal GPUs on which to run the HEAVY.AI platform include:

  • NVIDIA Tesla A100

  • NVIDIA Tesla V100 v2

  • NVIDIA Tesla V100 v1

  • NVIDIA Tesla P100

  • NVIDIA Tesla P40

  • NVIDIA Testa T4

The following configurations are valid for systems using any of these GPUs as the building blocks of your system. For production systems, use Tesla enterprise-grade cards. Avoid mixing card types in the same system; use a consistent card model across your environment.

Primary factors to consider when choosing GPU cards are:

  • The amount of GPU RAM available on each card

  • The number of GPU cores

  • Memory bandwidth

Newer cards like the Tesla V100 have higher double-precision compute performance, which is important in geospatial analytics. The Tesla V100 models support the NVLink interconnect, which can provide a significant speed increase for some query workloads.

GPU

Memory/GPU

Cores

Memory Bandwidth

NVLink

A100

40 to 80 GB

6912

1134 GB/sec

V100 v2

32 GB

5120

900 GB/sec

Yes

V100

16 GB

5120

900 GB/sec

Yes

P100

16 GB

3584

732 GB/sec

Yes

P40

24GB

3840

346 GB/sec

No

T4

16GB

2560

320 GB/sec

For advice on optimal GPU hardware for your particular use case, ask your HEAVY.AI sales representative.

HeavyDB Architecture

Before considering hardware details, this topic describes the HeavyDB architecture.

HeavyDB is a hybrid compute architecture that utilizes GPU, CPU, and storage. GPU and CPU are the Compute Layer, and SSD storage is the Storage Layer.

When determining the optimal hardware, make sure to consider the storage and compute layers separately.

Loading raw data into HeavyDB ingests data onto disk, so you can load as much data as you have disk space available, allowing some overhead.

When queries are executed, HeavyDB optimizer utilizes GPU RAM first if it is available. You can view GPU RAM as an L1 cache conceptually similar to modern CPU architectures. HeavyDB attempts to cache the hot data. If GPU RAM is unavailable or filled, HeavyDB optimizer utilizes CPU RAM (L2). If both L1 and L2 are filled, query records overflow to disk (L3). To minimize latency, use SSDs for the Storage Layer.

You can run a query on a record set that spans both GPU RAM and CPU RAM as shown in the diagram above, which also shows the relative performance improvement you can expect based on whether the records all fit into L1, a mix of L1 and L2, only L2, or some combination of L1, L2, and L3.

Hot Records and Columns

The server is not limited to any number of hot records. You can store as much data on disk as you want. The system can also store and query records in CPU RAM, but with higher latency. The hot records represent the number of records on which you can perform zero-latency queries.

Projection-only Columns

CPU RAM

The amount of CPU RAM should equal four to eight times the amount of total available GPU memory. Each NVIDIA Tesla P40 has 24 GB of onboard RAM available, so if you determine that your application requires four NVIDIA P40 cards, you need between 4 x 24 GB x 4 (384 GB) and 4 x 24 GB x 8 (768 GB) of CPU RAM. This correlation between GPU RAM and CPU RAM exists because the HeavyDB uses CPU RAM in certain operations for columns that are not filtered or aggregated.

SSD Storage

A HEAVY.AI deployment should be provisioned with enough SSD storage to reliably store the required data on disk, both in compressed format and in HEAVY.AI itself. HEAVY.AI requires 30% overhead beyond compressed data volumes. HEAVY.AI recommends drives such as the Intel® SSD DC S3610 Series, or similar, in any size that meets your requirements.

  • For maximum ingestion speed, HEAVY.AI recommends ingesting data from files stored on the HEAVY.AI instance.

  • Most public cloud environments’ default storage is too small for the data volume HEAVY.AI ingests. Estimate your storage requirements and provision accordingly.

Hardware Sizing Schedule

GPU Count

GPU RAM (GB)

CPU RAM (GB)

“Hot” Records

(NVIDIA P40)

8x GPU RAM

L1

1

24

192

417M

2

48

384

834M

3

72

576

1.25B

4

96

768

1.67B

5

120

960

2.09B

6

144

1,152

2.50B

7

168

1,344

2.92B

8

192

1,536

3.33B

12

288

2,304

5.00B

16

384

3,456

6.67B

20

480

3,840

8.34B

24

576

4,608

10.01B

28

672

5,376

11.68B

32

768

6,144

13.34B

40

960

7,680

16.68B

48

1,152

9,216

20.02B

56

1,344

10,752

23.35B

64

1,536

12,288

26.69B

128

3,072

24,576

53.38B

256

6,144

49,152

106.68B

If you already have your data in a database, you can look at the largest fact table, get a count of those records, and compare that with this schedule.

If you have a .csv file, you need to get a count of the number of lines and compare it with this schedule.

CPU Cores

HEAVY.AI uses the CPU in addition to the GPU for some database operations. GPUs are the primary performance driver; CPUs are utilized secondarily. More cores provide better performance but increase the cost. Intel CPUs with 10 cores offer good performance for the price. For example, so you could configure your system with a single NVIDIA P40 GPU and two 10-core CPUs. Similarly, you can configure a server with eight P40s and two 10-core CPUs.

Suggested CPUs:

  • Intel® Xeon® E5-2650 v3 2.3GHz, 10 cores

  • Intel® Xeon® E5-2660 v3 2.6GHz, 10 cores

  • Intel® Xeon® E5-2687 v3 3.1GHz, 10 cores

  • Intel® Xeon® E5-2667 v3 3.2GHz, 8 cores

PCI Express (PCIe)

GPUs are typically connected to the motherboard using PCIe slots. The PCIe connection is based on the concept of a lane, which is a single-bit, full-duplex, high-speed serial communication channel. The most common numbers of lanes are x4, x8, and x16. The current PCIe 3.0 version with an x16 connection has a bandwidth of 16 GB/s. PCIe 2.0 bandwidth is half the PCIe 3.0 bandwidth, and PCIe 1.0 is half the PCIe 2.0 bandwidth. Use a motherboard that supports the highest bandwidth, preferably, PCIe 3.0. To achieve maximum performance, the GPU and the PCIe controller should have the same version number.

The PCIe specification permits slots with different physical sizes, depending on the number of lanes connected to the slot. For example, a slot with an x1 connection uses a smaller slot, saving space on the motherboard. However, bigger slots can actually have fewer lanes than their physical designation. For example, motherboards can have x16 slots connected to x8, x4, or even x1 lanes. With bigger slots, check to see if their physical sizes correspond to the number of lanes. Additionally, some slots downgrade speeds when lanes are shared. This occurs most commonly on motherboards with two or more x16 slots. Some motherboards have only 16 lanes connecting the first two x16 slots to the PCIe controller. This means that when you install a single GPU, it has the full x16 bandwidth available, but two installed GPUs each have x8 bandwidth.

HEAVY.AI does not recommend adding GPUs to a system that is not certified to support the cards. For example, to run eight GPU cards in a machine, the BIOS register the additional address space required for the number of cards. Other considerations include power routing, power supply rating, and air movement through the chassis and cards for temperature control.

NVLink

NVLink is a bus technology developed by NVIDIA. Compared to PCIe, NVLink offers higher bandwidth between host CPU and GPU and between the GPU processors. NVLink-enabled servers, such as the IBM S822LC Minsky server, can provide up to 160 GB/sec bidirectional bandwidth to the GPUs, a significant increase over PCIe. Because Intel does not currently support NVLink, the technology is available only on IBM Power servers. Servers like the NVIDIA-manufactured DGX-1 offer NVLink between the GPUs but not between the host and the GPUs.

System Examples

A variety of hardware manufacturers make suitable GPU systems. For more information, follow these links to their product specifications.

The amount of data you can process with the HEAVY.AI database depends primarily on the amount of GPU RAM and CPU RAM available across HEAVY.AI cluster servers. For zero-latency queries, the system caches compressed versions of the row- and column-queried fields into GPU RAM. This is called hot data (see ). Semi-hot data utilizes CPU RAM for certain parts of the data.

show example configurations to help you configure your system.

The table refers to hot records, which are the number of records that you want to put into GPU RAM to get zero-lag performance when querying and interacting with the data. The Hardware Sizing Schedule assumes 16 hot columns, which is the number of columns involved in the predicate or computed projections (such as, column1 / column2) of any one of your queries. A 15 percent GPU RAM overhead is reserved for rendering buffering and intermediate results. If your queries involve more columns, the number of records you can put in GPU RAM decreases, accordingly.

HeavyDB does not require all queried columns to be processed on the GPU. Non-aggregate projection columns, such as SELECT x, y FROM table, do not need to be processed on the GPU, so can be stored in CPU RAM. The CPU RAM sizing assumes that up to 24 columns are used in only non-computed projections, in addition to the .

This schedule estimates the number of records you can process based on GPU RAM and CPU RAM sizes, assuming up to 16 hot columns (see ). This applies to the compute layer. For the storage layer, provision your application according to guidelines.

HEAVY.AI recommends installing GPUs in motherboards with support for as much PCIe bandwidth as possible. On modern Intel chip sets, each socket (CPU) offers 40 lanes, so with the correct motherboards, each GPU can receive x8 of bandwidth. All recommended have motherboards designed for maximizing PCIe bandwidth to the GPUs.

For an emerging alternative to PCIe, see .

System Examples
NVLink
Dell 2 GPU 2U Server
NVIDIA DGX Workstation
System 76 Ibex Pro GPU Workstation
HPE ProLiant DL580 Gen10 Server
Penguin Computers NVIDIA DGX Workstations
Thinkmate NVIDIA Tesla GPU Servers
Colfax NVIDIA DGX Workstations
Hot Records and Columns
System Examples
Hardware Sizing Schedule
Hardware Sizing Schedule
Hot Records and Columns
Hot Records and Columns
SSD Storage

Software Requirements

  • Operating Systems

    • CentOS/RHEL 7.0 or later

    • Ubuntu 20.04 or later

Ubuntu 22.04 is not currently supported.

  • Additional Components

    • OpenJDK version 8 or higher

    • EPEL

    • wget or curl

    • Kernel headers

    • Kernel development packages

    • log4j 2.15.0 or higher

  • NVIDIA hardware and software (for GPU installs only)

    • Hardware: Ampere, Turing, Volta, or Pascal series GPU cards. HEAVY.AI recommends that each GPU card in a server or distributed environment be of the same series.

    • Software:

      • NVIDIA CUDA drivers version 520 and Cuda 11.8 or higher. Run nvidia-smi to determine the currently running driver.

      • Up-to-date Vulkan drivers.

  • Supported web browsers (Enterprise Edition, Immerse). Latest stable release of:

    • Chrome

    • FireFox

    • Safari version 15.x or higher

Some features in Heavy Immerse are not supported in the Internet Explorer browser due to performance issues in IE. HEAVY.AI recommends that you use a different browser to experience the latest Immerse features.

Installing on CentOS

In this section, you will find recipes to install HEAVY.AI platform and NVIDIA drivers using package manager like yum or tarball.

Installing on Ubuntu

In this section, you will find recipes to install HEAVY.AI platform and NVIDIA drivers using package manager like apt or tarball.

Installing on Docker

Installing OmniSci on Docker

In this section you will find the recipes to install HEAVY.AI platform using Dcoker.

Overview

HeavyDB

The foundation of the platform is HeavyDB, an open-source, GPU-accelerated database. HeavyDB harnesses GPU processing power and returns SQL query results in milliseconds, even on tables with billions of rows. HeavyDB delivers high performance with rapid query compilation, query vectorization, and advanced memory management.

Native SQL

Geospatial Data

HeavyDB can store and query data using native Open Geospatial Consortium (OGC) types, including POINT, LINESTRING, POLYGON, and MULTIPOLYGON. With geo type support, you can query geo data at scale using special geospatial functions. Using the power of GPU processing, you can quickly and interactively calculate distances between two points and intersections between objects.

Open Source

HeavyRender

HeavyRender works on the server side, using GPU buffer caching, graphics APIs, and a Vega-based interface to generate custom pointmaps, heatmaps, choropleths, scatterplots, and other visualizations. HEAVY.AI enables data exploration by creating and sending lightweight PNG images to the web browser, avoiding high-volume data transfers. Fast SQL queries make metadata in the visualizations appear as if the data exists on the browser side.

Network bandwidth is a bottleneck for complex chart data, so HEAVY.AI uses in-situ rendering of on-GPU query results to accelerate visual rendering. This differentiates HEAVY.AI from systems that execute queries quickly but then transfer the results to the client for rendering, which slows performance.

Geospatial Analysis

Efficient geospatial analysis requires fast data-rendering of complex shapes on a map. HEAVY.AI can import and display millions of lines or polygons on a geo chart with minimal lag time. Server-side rendering technology prevents slowdowns associated with transferring data over the network to the client. You can select location shapes down to a local level, like census tracts or building footprints, and cross-filter interactively.

Visualize with Vega

Heavy Immerse

Dashboards

Charts

Create geo charts with multiple layers of data to visualize the relationship between factors within a geographic area. Each layer represents a distinct metric overlaid on the same map. Those different metrics can come from the same or a different underlying dataset. You can manipulate the layers in various ways, including reorder, show or hide, adjust opacity, or add or remove legends.

Use Multiple Sources

Heavy Immerse can visually display dozens of datasets in the same dashboard, allowing you to find multi-factor relationships that you might not otherwise consider. Each chart (or groups of charts) in a dashboard can point to a different table, and filters are applied at the dataset level. Multisource dashboards make it easier to quickly compare across datasets, without merging the underlying tables.

Streaming Data

Heavy Immerse is ideal for high-velocity data that is constantly streaming; for example, sensor, clickstream, telematics, or network data. You can see the latest data to spot anomalies and trend variances rapidly. Immerse auto-refresh automatically updates dashboards at flexible intervals that you can tailor to your use case.

Ready to Get Started?

I want to...

See...

Install HEAVY.AI

Upgrade to the latest version

Configure HEAVY.AI

See some tutorials and demos to help get up and running

Learn more about charts in Heavy Immerse

Use HEAVY.AI in the cloud

See what APIs work with HEAVY.AI

Learn about features and resolved issues for each release

Know what issues and limitations to look out for

See answers to frequently asked questions

Installation

The CPU (no GPUs) install does not support backend rendering. For example, Pointmap and Scatterplot charts are not available. The GPU install supports all chart types.

The Open Source options do not require a license, and do not include Heavy Immerse.

Welcome to HEAVY.AI Documentation

What Will I Learn?

For Analysts

For Administrators

For Developers and Data Scientists

Release Highlights

Release 7.0

Overview

We are also pleased to announce the general availability of our new backend Executor Resource Manager with CPU / GPU parallelism and query policy controls such as executor type, memory and time limits. We can also now support CPU queries larger than available CPU memory.

This release also features the debut of a user interface for joins in Immerse (beta), supporting inner and left joins which are named and persisted in dashboards. This provides analytic and visualization access to joined columns, complementing the prior table linking function supporting cross-filtering.

Powerful machine learning (beta) and statistical methods (beta) are now available in the database, supporting high performance predictive analytics workflows. For example you can now perform clustering or run linear regression or random forest models on large datasets with interactive inferencing.

Immerse also gains a large set of dashboard refinements, including an optional ‘minimalist’ style with hidden chart titles, and an optional new text chart with full HTML and font controls.

There are several major external dependency updates in this release. With Ubuntu 18 reaching its end of life we now require Ubuntu 20.04. For similar reasons, we now support NVIDIA CUDA version 11.8, which deprecates support for Kepler GPUs. Last but not least, we are formally retiring polygon ‘render groups’ within the database, a change which is not backwards compatible. So full database backups are required as part of this upgrade.

Heavy Immerse

New Features and Improvements

  • BETA: Joins in Immerse

  • BETA: Enhanced text chart. The flag `ui/enable_new_text_chart` adds a “text2” chart type, with additional features:

    • font family (e.g. arial)

    • font sizes, line height

    • colors populated from dashboard palette

    • html table

    • undo/redo

    • separator line with styles

    • full html support

  • Added a new “minimal” style mode in which chart titles are hidden by default but appear on rollover. Controlled by feature flag `ui/minimize_chart_size` which defaults to off

  • Within map chart editor geo layers are now renamable

  • Role-based access to control panel UX previously requiring admin access.

HeavyML (BETA)

7.0 marks the beta release of HeavyML, a new set of capabilities to execute accelerated machine learning workflows directly from SQL.

General Capabilities and Methods

  • Named model creation is supported via a new CREATE MODEL statement (see the release notes and documentation for more details)

  • Row-wise inference (GPU-accelerated for GPU queries) can be performed via a new ML_PREDICT row-wise operator. This can be used as an Immerse custom measure and persisted into dashboards, allowing end-users to consume models without needing to know how to create or administer them.

  • An EVALUATE model function is provided to test models against metrics (such as r2).

  • Table functions are provided to access linear regression coefficients for linear regression models and variable importance scores for random forest models.

  • A new “SHOW MODELS” SQL command allows end users to determine which models are available.

  • More-detailed model metadata can be accessed by admins with SHOW MODEL DETAILS and in a new ml_models system table in the information_schema database.

Regression Algorithms

  • Four regression algorithms are supported initially: linear regression, random forest regression, decision trees, and Gradient Boosted Trees (GBT).

  • Both categorical text and continuous numeric regressors/predictors are supported. Categorical inputs are automatically one-hot-encoded.

  • Support for continuous variable prediction is initially supported, categorical classification is planned for a later release.

Clustering Algorithms

  • Two clustering algorithms are supported in this initial release: KMeans and DBScan.

  • Clustering algorithms can be called via associated table functions (more detail can be found in the relevant documentation), and currently support continuous numeric inputs only.

Performance and Administration

  • A new Executor Resource Manager (ERM) framework is provided

  • The ERM allows for CPU queries to run fully in parallel, and one or more CPU queries to run in parallel while a GPU query is executing (parallel GPU query kernel execution is not supported yet).

  • It also allows execution of CPU queries where the input datasets do not fit into the CPU buffer pool by executing on a fragment-by-fragment basis, paging from storage.

  • The Executor Resource Manager takes into account the resources needed for each query to schedule them in the most efficient manner.

  • It is defaulted on, however it can be turned off using the following flag: --enable-executor-resource-mgr=0, which will lead query kernel execution to follow the same serial, pre-7.0 path.

HeavyRF

New Features and Improvements

A new “cell editor” is provided. This supports multi-band antennas mounted within various sites within a cell. Various antenna attributes such as horizontal and vertical falloff can be easily applied based on an extensible library of antenna types.

Vegetation and building envelope attenuation can now be directly or indirectly specified. For example, typical values can be provided as scalar constants, or clutter object-specific attributes can be derived from normal SQL cursor queries. Vegetation attenuation can be tied to measurements of canopy moisture content from remote sensing based on seasonal statistics, or for individual dates to match drive test data. Building attenuation can be driven by various known or inferred characteristics, such as from parcels databases.

The right-hand information panel has been extended to better support targeting of large numbers of buildings. This can be done directly by searching and filtering on building attributes in the HeavyRF application, such as building type or size. But it can also be combined with analyses in Immerse extending to multiple arbitrary tags. For example, a set of locations with high customer value and high potential for churn can be identified in Immerse and tagged with attributes searchable in HeavyRF.

Last but not least, the HeavyRF platform will soon be available on NVIDIA’s LaunchPad. This facilitates initial evaluation of the software by making it immediately available together with appropriate supporting GPU hardware.

Release 6.4

HEAVY.AI continues to refine and extend the data connectors ecosystem. This release features general availability of data connectors for PostGreSQL, beta Immerse connectors for Snowflake and Redshift, and SQL support for Google BigQuery and Hive (beta). These managed data connections let you use HEAVY.AI as an acceleration platform, wherever your source data lives. Scheduling and automated caching ensure that from an end-user perspective, fast analytics are always running on the latest available data.

Immerse features four new chart types: Contour, Cross-section, Wind barb and Skew-t. While these are especially useful for atmospheric and geotechnical data visualization, Contour and Cross-section also have more general application.

Major improvements for time series analysis have been added. This includes time series comparison via window functions, and a large number of SQL window function additions and performance enhancements.

This release also includes two major architectural improvements:

  • The ability to perform cross-database queries in SQL, increasing flexibility across the board.

  • Render queries no longer block other GPU queries. In many use cases, renders can be significantly slower than other common queries. This should result in significant performance gains, particularly in map-heavy dashboards.

Release 6.2

Heavy Immerse

  • Chart animation through cross filter replay, allowing controlled playback of time-based data such as weather maps or GPS tracks.

  • You can now directly export your charts and dashboards as image files.

  • New control panel enables administrators to view the configuration of the system and easily access logs and system tables.

  • HeavyConnect now provides graphical Heavy Immerse support for Redshift, Snowflake, and PostGIS connections.

  • For CPU-only systems, mapping capabilities are improved with the introduction of multilayer CPU-rendered geo.

General Analytics

  • Numerous improvements to core SQL and geoSQL capabilities.

  • Support for string to numeric, timestamp, date, and time types with the new TRY_CAST operator.

  • Explicit and implicit cast support for numeric, timestamp, date, and time types.

  • Advanced string functions facilitate extraction of data from JSON and externally encoded string formats.

  • Improvements to COUNT DISTINCT reduces memory requirements considerably in cases with very large cardinalities or highly skewed data distributions.

  • Added MULTIPOINT and MULTILINESTRING geo types.

  • Convex and concave hull operators, allowing generation of polygons from points and multipoints. For example, you could generate polygons from clusters of GPS points.

  • Syntax and performance optimizations across all geometry types, table orderings, and commonly nested functions.

  • Significant functionality extension of window functions; define windows directly in temporal terms, which is particularly important in time series with missing observations. Window frame support allows improved control at the edges of windows.

Advanced Analytics

  • Two new functions now support direct loading of LiDAR data: tf_point_cloud_metadata quickly searches tile metadata and helps you find data to import, and tf_load_point_cloud does the actual import importing.

  • Network graph analytics functions have been added. These can work on networks alone, including non-geographic networks, or can find the least-cost path along a geographic network.

  • New spatial aggregation and smoothing functions. Aggregations work particularly well with LiDAR data--for example to pass through only the highest point within an area to create building or canopy height maps. Smoothing helps with noisy datasets and can reveal larger-scale patterns while minimizing visual distractions.

Release 6.1

Release 6.1.0 features more granular administrative monitoring dashboards based on logs. These have been accessible in an open format on the server side, and now they are available in Immerse, by specific dashboards, users, or queries. Intermediate and advanced SQL support continues to mature, with INSERT, window functions, and UNION ALL.

This release contains a number of user interface polish items requested by customers. Cartography now supports polygons with colorful borders and transparent fills. Table presentation has been enhanced in various ways, from alignment to zebra striping. And dashboard saving reminders have been scaled back, based on customer feedback.

The extension framework now features an enhanced “custom source” dialog, as well as new SQL commands to see installed extensions and their parameters. We introduce three new extensions. The first, tf_compute_dwell_times, reduces GPS event stream data volumes considerably while keeping relevant information. The others compute feature similarity scores and are very general.

This release also includes initial public betas of our PostgreSQL Immerse connector, and SQL support for COPY FROM ODBC database connections, making it easier to connect to your enterprise data.

Release 6.0

This release features large advances in data access, including intelligent linking to enterprise data (HeavyConnect) and support for raster geodata. SQL support includes high-performance string functions, as well as enhancements to window functions and table unions. Performance improvements are noticeable across the product, including fundamental advances in rendering, query compilation, and data transport. Our system administration tools have been expanded with a new Admin Portal, as well as additional system tables supporting detailed diagnostics. Major strides in extensibility include new charting options and a new extensions framework (beta).

Name Changes

  • Rebranded platform from OmniSci to HEAVY.AI, with OmniSciDB now HeavyDB, OmniSci Render now HeavyRender, and OmniSci Immerse now Heavy Immerse.

HeavyConnect and Data Import

  • HeavyConnect allows the HEAVY.AI platform to work seamlessly as an accelerator for data in other data lakes and data warehouses. For Release 6.0, CSV and Parquet files on local file systems and in S3 buckets can be linked or imported. Other SQL databases are also supported via ODBC (beta).

  • HeavyConnect enables users to specify a data refresh schedule, which ensures access to up-to-date data.

  • Heavy Immerse now supports import of dozens of raster data formats, including geoTIFF, geoJPEG , and PNG. HeavySQL now supports most any vector GIS file format.

  • Support is included for multidimensional arrays common in the sciences, including Grib2, NetCDF, and hd5.

  • Immerse now supports linking or import of files on the server filesystem (local or mounted). This help prevent slow data transfers when client bandwidth is limited.

  • File globbing and filtering allow import of thousands of files at once.

Other Immerse Enhancements

  • New Gauge chart for easy visualization of key metrics relative to target thresholds.

  • New landing page and Help Center.

  • Enhanced mapping workflows with automated column picking.

SQL Enhancements

  • Support for a wide range of performant string operations using a new string dictionary translation framework, as well as the ability to on-the-fly dictionary encode none-encoded strings with a new ENCODE_TEXT operator.

  • Support for UNION ALL is now enabled by default, with significant performance improvements from the previous release (where it was beta flagged).

  • Significant functionality and performance improvements for window functions, including the ability to support expressions in PARTITION and ORDER clauses.

Performance

  • Parallel compilation of queries and a new multi-executor shared code cache provide up to 20% throughput/concurrency gains for interactive usage scenarios.

  • 10X+ performance improvements in many cases for initial join queries via optimized Join Hash Table framework.

  • New result set recycler allows for expensive query sub-steps to be cached via the SQL hint /*+ keep_result */, which can significantly increase performance when a subquery is used across multiple queries.

  • Arrow execution endpoints now leverage the parallel execution framework, and Arrow performance has been significantly improved when high-cardinality dictionary-encoded text columns are returned

  • Introduces a novel polygon rendering algorithm that does not require pre-triangulated or pre-grouped polygons and can render dynamically generated geometry on the fly (via ST_Buffer). The new algorithm is comparable to its predecessor in terms of both performance and memory and enables optimizations and enhancements in future releases.

  • New binary transport protocol to Heavy Immerse that significantly increases performance and interactivity for large result sets

System Administration

  • A new Admin Portal provides information on system resources usage and users.

  • System table support under a new information_schema database, containing 10 new system tables providing system statistics and memory and storage utilization.

Extensibility

  • New system and user-defined UDF framework (beta), comprising both row (scalar) and table (UDTF) functions, including the ability to define fast UDFs via Numba Python using the RBC framework, which are then inlined into the HeavyDB compiled query code for performant CPU and GPU execution.

  • System-provided table functions include generate_series for easy numeric series generation, tf_geo_rasterize_slope for fast geospatial binning and slope/aspect computation over elevation data, and others, with more capabilities planned for future releases.

  • Leveraging the new table function framework, a new HeavyRF module (licensed separate) includes tf_rf_prop and tf_rf_prop_max_signal table functions for fast radio frequency signal propagation analysis and visualization.

  • New Iframe chart type in Heavy Immerse to allow easier addition of custom chart types. (BETA)

Release 5.10

  • Row-level security (RLS) can be used by an administrator to apply security filtering to queries run as a user or with a role.

  • Support for import from dozens of image and raster file types, such as jpeg, png, geotiff, and ESRI grid, including remote files.

  • Significantly more performant, parallelized window functions, executing up to 10X faster than in Release 5.9.

  • Automatic use of columnar output (instead of the default row-wise output) for large projections, reducing query times by 5-10X in some cases.

  • Support for full set of ST_TRANSFORM SRIDs supported by geos/proj4 library.

  • Support for numerous vector GIS files (100+ formats supported by current GDAL release).

  • Support for multidimensional array import from formats common in science and meteorology.

  • Improved Table chart export to access all data represented by a Table chart.

  • Introduced dashboard-level named custom SQL.

Release 5.9

  • Significant speedup for POINT and fixed-length array imports and CTAS/ITAS, generally 5-20X faster.

  • The PNG encoding step of a render request is no longer a blocking step, providing improvement to render concurrency.

  • Adds support to hide legacy chart types from add/edit chart menu in preparation for future deprecation (defaults to off).

  • BETA - Adds custom expressions to table columns, allowing for reusable custom dimensions and measures within a single dashboard (defaults to off).

  • BETA - Adds Crosslink feature with Crosslink Panel UI, allowing crossfilters to fire across different data sources within the same dashboard (defaults to off).

  • BETA - Adds Custom SQL Source support and Custom SQL Source Manager, allowing the creation of a data source as a SQL statement (defaults to off)

Release 5.8

  • Parallel execution framework is on by default. Running with multiple executors allows parts of query evaluation, such as code generation and intermediate reductions, to be executed concurrently. Currently available for single-node deployments.

  • Spatial joins between geospatial point types using the ST_Distance operator are accelerated using the overlaps hash join framework, with speedups up to 100x compared to Release 5.7.1.

  • Significant performance gains for many query patterns through optimization of query code generation, particularly benefitting CPU queries.

  • Window functions can now be executed without a partition clause being specified (to signify a partition encompassing all rows in the table).

  • Window functions can now execute over tables with multiple fragments and/or shards.

  • Native support for ST_Transform between all UTM Zones and EPSG:4326 (Lon/Lat) and EPSG:900913 (Web Mercator).

  • ST_Equals support for geospatial columns.

  • Support for the ANSI SQL WIDTH_BUCKET operator for easier and more performant numeric binning, now also used in Immerse for all numeric histogram visualizations

  • The Vulkan backend renderer is now enabled by default. The legacy OpenGL renderer is still available as a fallback if there are blocking issues with Vulkan. You can disable the Vulkan renderer using the renderer-use-vulkan-driver=false configuration flag.

    • Vulkan provides improved performance, memory efficiency, and concurrency.

    • You are likely to see some performance and memory footprint improvements with Vulkan in Release 5.8, most significantly in multi-GPU systems.

  • Support for file path regex filter and sort order when executing the COPY FROM command.

  • New ALTER SYSTEM CLEAR commands that enable clearing CPU or GPU memory from Immerse SQL Editor or any other SQL client.

Release 5.7

  • Extensive enhancements to Immerse support for parameters. Parameters can now be used in chart column selectors, chart filters, chart titles, global filters, and dashboard titles. Dashboards can have parameter widgets embedded on them, side-by-side with charts. Parameter values are visible in chart axes/labels, legends, and tooltips, and you can toggle parameter visibility.

  • In Immerse Pointmap charts, you can specify which color-by attribute always render on top, which is useful for highlight anomalies in data.

  • Significantly faster and more accurate "lasso" tool filters geospatial data on Immerse Pointmap charts, leveraging native geospatial intersection operations.

  • Immerse 3D Pointmap chart and HTML support in text charts are available as a beta feature.

  • Airplane symbol shape has been added as a built-in mark type for the Vega rendering API.

  • Vega symbol and multi-GPU polygon renders have been made significantly faster.

  • User-interrupt of query kernels is now on by default. Queries can be interrupted using Ctrl + C in omnisql, or by calling the interrupt API.

  • Parallel executors is in public beta (set with --num-executors flag).

  • Support for APPROX_QUANTILE aggregate.

  • Support for default column values when creating a table and across all append endpoints, including COPY TO, INSERT INTO TABLE SELECT, INSERT, and binary load APIs.

  • Faster and more robust ability to return result sets in Apache Arrow format when queried from a remote client (i.e. non-IPC).

  • More performant and robust high-cardinality group-by queries.

  • ODBC driver now supports Geospatial data types.

Release 5.6

  • Custom SQL dimensions, measures, and filters can now be parameterized in Immerse, enabling more flexible and powerful scenario analysis, projections, and comparison use cases.

  • New angle measure added to Pointmap and Scatter charts, allowing orientation data to be visualized with wedge and arrow icons.

  • Custom SQL modal with validation and column name display now enabled across all charts in Immerse.

  • Significantly faster point-in-polygon joins through a new range join hash framework.

  • Approximate Median function support.

  • INSERT and INSERT FROM SELECT now support specification of a subset of columns.

  • Automatic metadata updates and vacuuming for optimizing space usage.

  • Significantly improved OmniSciDB startup time, as well as a number of significant load and performance improvements.

  • Improvements to line and polygon stroke rendering and point/symbol rendering.

Release 5.5

  • Ability to set annotations on New Combo charts for different dimension/measure combinations.

  • New ‘Arrow-over-the-wire’ capability to deliver result sets in Apache Arrow format, with ~3x performance improvement over Thrift-based result set serialization.

  • Support for concurrent SELECT and UPDATE/DELETE queries for single-node installations

  • Initial OmniSci Render support for CPU-only query execution ("Query on CPU, render on GPU"), allowing for a wider set of deployment infrastructure choices.

  • Cap metadata stored on previous states of a table by using MAX_ROLLBACK_EPOCHS, improving performance for streaming and small batch load use cases and modulating table size on disk

Release 5.4

  • Added initial compilation support for NVIDIA Ampere GPUs.

  • Improved performance for UPDATE and DELETE queries.

  • Improved the performance of filtered group-by queries on large-cardinality string columns.

  • Added SQL function SAMPLE_RATIO, which takes a proportion between 0 and 1 as an input argument and filters rows to obtain a sampling of a dataset.

  • Added support for exporting geo data in GeoJSON format.

  • Dashboard filter functionality is expanded, and filters can be saved as views.

  • You can perform bulk actions on the dashboard list.

  • New UI Setting panel in Immerse for customizing charts.

  • Tabbed dashboards.

  • SQL Editor now handles Vega JSON requests.

Release 5.3

  • New Combo chart type in Immerse provides increased configurability and flexibility.

  • Immerse chart-specific filters and quick filters add increased flexibility and speed.

  • Updated Immerse Filter panel provides a Simple mode and Advanced mode for viewing and creating filters.

  • On multilayer charts, layer visibility can be set by zoom level.

  • Different map charts can be synced together for pan and zoom actions, regardless of data source.

  • Array support for the Array type over JDBC.

  • SELECT DISTINCT in UNION ALL is supported. (UNION ALL is prerelease and must be explicitly enabled.

  • Support for joins on DECIMAL types.

  • Performance improvements on CUDA GPUs, particularly Volta and Turing.

Release 5.2

  • NULL support for geospatial types, including in ALTER TABLE ADD COLUMN.

  • Ability to perform updates and deletes on temporary tables.

  • Updates to JDBC driver, including escape syntax handling for the fn keyword and added support to get table metadata.

  • Notable performance improvements, particularly for join queries, projection queries with order by and/or limit, queries with scalar subqueries, and multicolumn group-by queries.

  • Query interrupt capability improved to allow canceling long-running queries, also supports JDBC now.

  • Database switching from within Immerse, as well as dashboard URLs that contain the database name.

  • Over 50% reduction in load times for the dashboards list initial load and search.

  • Cohort builder now supports count (# records) in aggregate filter.

  • Improved error handling and more meaningful error messages.

  • Custom logos can now be configured separately for light and dark themes.

  • Logos can be configured to deep-link to a specific URL.

Release 5.1

  • Added support for UPDATE via JOIN with a subquery in the WHERE clause.

  • Improved performance for multi-column GROUP BY queries, as well as single column GROUP BY queries with high cardinality. Performance improvement varies depending on data volume and available hardware, but most use cases can expect a 1.5 to 2x performance increase over OmniSciDB 5.0.

  • Improved support for EXISTS and NOT EXISTS subqueries.

  • Added support for LINESTRING, POLYGON, and MULTIPOLYGON in user defined functions.

  • Immerse log-ins are fully sessionized and persist across page refreshes.

  • New filter sets can be created through duplicating existing filter sets.

Release 5.0

  • The new filter panel in Immerse enables the ability to toggle filters on and off, and introduces Filter Sets to provide quick access to different sets of filters in one dashboard.

  • Immerse now supports using global and cross-filters to interactively build cohorts of interest, and the ability to apply a cohort as a dashboard filter, either within the existing filter set or in a new filter set.

  • Data Catalog, located within Data Import, is a repository of datasets that users can use to enhance existing analyses.

  • Added support for binary dump and restore of database tables.

  • Added support for compile-time registered user-defined functions in C++, and experimental support for runtime user-defined SQL functions and table functions in Python via the Remote Backend Compiler.

  • Support for some forms of correlated subqueries.

  • Support for update via subquery, to allow for updating a table based on calculations performed on another table.

  • Multistep queries that generate large, intermediate result sets now execute up to 2.5x faster by leveraging new JIT code generator for reductions and optimized columnarization of intermediate query results.

  • Frontend-rendered choropleths now support the selection of base map layers.

Release Notes

Release notes for currently supported releases

Currently Supported Releases

As with any software upgrade, it is important to back up your data before you upgrade HEAVY.AI. In addition, we recommend testing new releases before deploying in a production environment.

Release 7.x.x - Important Information

IMPORTANT - In HeavyDB Release 7.x.x, the “render groups” mechanism, part of the previous implementation of polygon rendering, has been removed. When you upgrade to HeavyDB Release 7.x.x, all existing tables that have a POLYGON or MULTIPOLYGON geo column are automatically migrated to remove a hidden column containing "render groups" metadata.

This operation is performed on all tables in all catalogs at first startup, and the results are recorded in the INFO log.

Once a table has been migrated in this manner, it is not backwards-compatible with earlier versions of HeavyDB. If you revert to an earlier version, the table may appear to have missing columns and behavior will be undefined. Attempting to query or render the POLYGON or MULTIPOLYGON data with the earlier version may fail or cause a server crash.

As always, HEAVY.AI strongly recommends that all databases be backed up, or at the very least, dumps are made of tables with POLYGON or MULTIPOLYGON columns using the existing HeavyDB version, before upgrading to HeavyDB Release 7.x.x.

Dumps of POLYGON and MULTIPOLYGON tables made with earlier versions can still be restored into HeavyDB Release 7.x.x. The superfluous metadata is automatically discarded. However, dumps of POLYGON and MULTIPOLYGON tables made with HeavyDB Release 7.x.x are not backwards-compatible with earlier versions.

This applies only to tables with POLYGON or MULTIPOLYGON columns. Tables that contain other geo column types (POINT, LINESTRING, etc.), or only non-geo column types, do not require migration and remain backwards-compatible with earlier relea

For Ubuntu installations, install libncurses5 with the following command:

sudo apt install libncurses5

Release 7.2.4 - March 20, 2024

HeavyDB - Fixed Issues

  • Adds a new option for enabling or disabling the use of virtual addressing when accessing an S3 compatible endpoint for import or HeavyConnect.

  • Improves logging related to system locks.

Heavy Immerse - Fixed Issues

  • Fixes issue with SAML authentication.

Release 7.2.3 - February 5, 2024

HeavyDB - New Features and Improvements

  • Improves performance of foreign tables that are backed by Parquet files in AWS S3.

  • Improves logging related to GPU memory allocations and data transfers.

HeavyDB - Fixed Issues

  • Fixes a crash that could occur for certain query patterns with intermediate geometry projections.

  • Fixes a crash that could occur for certain query patterns containing IN operators with string function operands.

  • Fixes a crash that could occur for equi join queries that use functions as operands.

  • Fixes an intermittent error that could occur in distributed configurations when executing count distinct queries.

  • Fixes an issue where certain query patterns with LIMIT and OFFSET clauses could return wrong results.

  • Fixes a crash that could occur for certain query patterns with left joins on Common Table Expressions.

  • Fixes a crash that could occur for certain queries with window functions containing repeated window frames.

Heavy Render - Fixed Issues

  • Fix several crashes that could occur during out-of-gpu memory error recovery

Heavy Immerse - Fixed Issues

  • Fixed dashboard load error when switching tabs.

  • Fixed table reference in size measure of a client-side join data source for linemap chart.

  • Fixed client-side join name reference.

Release 7.2.2 - December 15, 2023

HeavyDB - New Features and Improvements

  • Adds support for output/result set buffer allocations via the "cpu-buffer-mem-bytes" configured CPU memory buffer pool. This feature can be enabled using the "use-cpu-mem-pool-for-output-buffers" server configuration parameter.

  • Adds a "ndv-group-estimator-multiplier" server configuration parameter that determines how the number of unique groups are estimated for specific query patterns.

  • Adds "default-cpu-slab-size" and "default-gpu-slab-size" server configuration parameters that are used to determine the default slab allocation size. The default size was previously based on the "max-cpu-slab-size" and "max-gpu-slab-size" configuration parameters.

  • Improves memory utilization when querying the "dashboards" system table.

  • Improves memory utilization in certain cases where queries are retried on CPU.

  • Improves error messages that are returned for some unsupported correlated subquery use cases.

HeavyDB - Fixed Issues

  • Fixes an issue where allocations could go beyond the configured "cpu-buffer-mem-bytes" value when fetching table chunks.

  • Fixes a crash that could occur when executing concurrent sort queries.

  • Fixes a crash that could occur when invalid geometry literals are passed to ST functions.

Heavy Immerse - Fixed Issues

Fix for rendering a gauge chart using a parameterized source (join sources, custom sources).

Release 7.2.1 - December 4, 2023

HeavyDB - New Features and Improvements

  • Improves instrumentation around Parquet import and HeavyConnect.

HeavyDB - Fixed Issues

  • Fixes a crash that could occur for join queries that result in many bounding box overlaps.

  • Fixes a crash that could occur in certain cases for queries containing an IN operator with a subquery parameter.

  • Fixes an issue where the ST_POINTN function could return wrong results when called with negative indexes.

  • Fixes an issue where a hang could occur while parsing a complex query.

Heavy Render - Fixed Issues

  • Fixed error when setting render-mem-bytes greater than 4gb.

Heavy Immerse - Fixed Issues

  • Clamp contour interval size on the Contour Chart to prevent a modulo operation error.

  • Filter outlier values in the Contour Chart that skew color range.

  • Fixed sample ratio query ordering to address a pointmap rendering issue.

  • Fixed layer naming in the Hide Layer menu.

Release 7.2.0 - November 16, 2023

HeavyDB - New Features and Improvements

  • Adds support for URL_ENCODE, URL_DECODE, REGEXP_COUNT, and HASH string functions.

  • Enables log based system tables by default.

  • Adds support for log based system tables auto refresh behind a flag (Beta).

  • Improves the pre-flight query row count estimation process for projection queries without filters.

  • Improves the performance of the LIKE operator.

HeavyDB - Fixed Issues

General

  • Fixes errors that could occur when the REPLACE clause is applied to SQL DDL commands that do not support it.

  • Fixes an issue where the HeavyDB startup script could ignore command line arguments in certain cases.

  • Fixes a crash that could occur when requests were made to the detect_column_types API for Parquet files containing list columns.

  • Fixes a crash that could occur in heavysql when the \detect command is executed for Parquet files containing string list columns.

  • Fixes a crash that could occur when attempting to cast to text column types in SELECT queries.

  • Fixes a crash that could occur in certain cases where window functions were called with literal arguments.

  • Fixes a crash that could occur when executing the ENCODE_TEXT function on NULL values.

  • Fixes an issue where queries involving temporary tables could return wrong results due to incorrect cache invalidation.

Geo

  • Fixes an issue where the ST_Distance function could return wrong results when at least one of its arguments is NULL.

  • Fixes an issue where the ST_Point function could return wrong results when the "y" argument is NULL.

  • Fixes an issue where the ST_NPoints function could return wrong results for NULL geometries.

  • Fixes a crash that could occur when the ST_PointN function is called with out-of-bounds index values.

  • Fixes an issue where the ST_Intersects and ST_Contains functions could incorrectly result in loop joins based on table order.

  • Fixes an issue where the ST_Transform function could return wrong results for NULL geometries.

  • Fixes an error that could occur for tables with polygon columns created from the output of user-defined table functions.

Heavy Immerse - New Features and Improvements

  • [Beta] Geo Joins - Immerse now supports “contains” and “intersects” conditions for common geometry combinations when creating a join datasource in the no code join editor.

  • Join datasource crossfilter support: Charts that use single table data sources will now crossfilter and be crossfiltered by charts that use join data sources.

  • Layer Drawer - In layered map charts immerse now has a quick to access Layer Drawer, which allows for layer toggling, reordering, renaming, opacity, zoom visibility controls.

  • Zoom to filters - Map charts in immerse now support “zoom to filters” functionality, either on an individual chart layer (via the Layer Drawer) or on the whole chart.

  • Image support in map rollovers - URLs pointing to images will automatically be rendered as a scaled image, with clickthrough support to the full size image.

Heavy Immerse - Fixed Issues

  • Choropleth/Line Map join datasource support - Significantly improves performance in Choropleth and Line Map charts when using join data sources. Auto aggregates measures on geometry.

  • Fixes issue where sql editor will horizontally scroll with long query strings

Release 7.1.2 - October 4, 2023

HeavyDB - New Features and Improvements

  • Improves how memory is allocated for the APPROX_MEDIAN aggregate function.

HeavyDB - Fixed Issues

  • Fixes a crash that could occur when the DISTINCT qualifier is specified for aggregate functions that do not support the distinct operation.

  • Fixes an issue where wrong results could be returned for queries with window functions that return null values.

  • Fixes a crash that could occur in certain cases where queries have multiple aggregate functions.

  • Fixes a crash that could occur when tables are created with invalid options.

  • Fixes a potential data race that could occur when logging cache sizes.

Release 7.1.1 - September 15, 2023

HeavyDB - New Features and Improvements

  • Adds an EXPLAIN CALCITE DETAILED command that displays more details about referenced columns in the query plan.

  • Improved logging around system memory utilization for each query.

  • Adds an option to SQLImporter for disabling logging of connection strings.

  • Adds a "gpu-code-cache-max-size-in-bytes" server configuration parameter for limiting the amount of memory that can be used by the GPU code cache.

  • Improves column name representation in Parquet validation error messages.

HeavyDB - Fixed Issues

  • Fixes a parser error that could occur for queries containing a NOT ILIKE clause.

  • Fixes a multiplication overflow error that could occur when retrying queries on CPU.

  • Fixes an issue where table dumps do not preserve quoted column names.

  • Fixes a "cannot start a transaction within a transaction" error that could occur in certain cases.

  • Fixes a crash that could occur for certain query patterns involving division by COUNT aggregation functions.

  • Removes a warning that is displayed on server startup when HeavyIQ is not configured.

  • Removes spurious warnings for CURSOR type checks when there are both cursor and scalar overloads for a user-defined table function.

Heavy Render - New Features and Improvements

  • Adds hit testing support for custom measures that reference multiple tables.

Heavy Immerse - Fixed Issues

  • Fixes SAML authentication regression in 7.1.0

  • Fixes chart export regression in 7.1.0

Release 7.1.0 - August 22, 2023

HeavyDB - New Features and Improvements

Geospatial

  • Exposes new geo overlaps function ST_INTERSECTSBOX for very fast bounding box intersection detections.

  • Adds support for the max_reject COPY FROM option when importing raster files. This ensures that imports from large multi-file raster datasets continue after minor errors, but provides adjustable notification upon major ones.

  • Adds a new ST_AsBinary (also aliased as ST_AsWKB) function that returns the Well-Known Binary (WKB) representation of geometry values. This highly-efficient format is used by postGIS newer versions of Geopandas.

  • Adds a new ST_AsText (also aliased AS ST_AsWKT) function that returns the Well-Known Text (WKT) representation of geometry values. This is less efficient than WKB but compatible even with nonspatial databases.

  • Adds support for loading geometry values using the load_table_binary_arrow Thrift API.

  • New version of HeavyAI python library with direct Geopandas support.

  • New version of rbc-project with geo column support allowing extensions which input or output any geometric type.

Core SQL

  • New JAROWINKLER_SIMILARITY string operator for fuzzy matching between string columns and values. This is a case insensitive measure including edit transitions and (slightly) sensitive to white space.

  • New LEVENSHTEIN_DISTANCE string operator for fuzzy matching between string columns and values. This is case insensitive and represents the number of edits needed to make two strings identical. An “edit” is defined by either an insertion of a character, a deletion of a character, or a replacement of a character.

  • Extends the ALTER COLUMN TYPE command to support string dictionary encoding size reduction.

  • Improves the error message returned when out of bound values are inserted into FLOAT and DOUBLE columns.

  • Adds a "watchdog-max-projected-rows-per-device" server configuration parameter and query hint that determines the maximum number of rows that can be projected by each GPU and CPU device.

  • Adds a "preflight-count-query-threshold" server configuration parameter and query hint that determines the threshold at which the preflight count query optimization should be executed.

  • Optimizes memory utilization for projection queries on instances with multiple GPUs.

Predictive Modeling with HeavyML

  • Support for PCA models and PCA_PROJECT operator.

  • Support SHOW MODEL FEATURE DETAILS to show per-feature info for models, including regression coefficients and variable importance scores, if applicable.

  • Support for TRAIN_FRACTION option to specify proportion of the input data to a CREATE MODEL statement that should be trained on.

  • Support creation of models with only categorical predictors.

  • Enable categorical and numeric predictors to be specified in any order for CREATE MODEL statements and subsequent inference operations.

  • Enable Torch table functions (requires client to specify libtorch.so).

  • Add tf_torch_raster_object_detect for raster object detections (requires client to specify libtorch.so and provide trained model in torchscript format).

Extensions Framework

  • Allow Array literals as arguments to scalar UDFs

  • Support table function (UDTF) output row sizes up to 16 trillion rows

  • Adds support for Column<TextEncodingNone> and ColumnList<TextEncodingNone> table function inputs and outputs.

Performance Optimizations

  • SQL projections now are sized per GPU/CPU core instead of globally, meaning that projections are more memory efficient as a function of the number of GPUs/CPU threads used for a query. In particular, this means that various forms of in-situ rendering, for example, non-grouped pointmaps, renders can scale to N GPUs more points or use N GPUs less memory, depending on the configuration.

  • Better parallelize construction of metadata for subquery results for improved performance

  • Enables result set caching for queries with LIMIT clauses.

  • Enables the bounding box intersection optimization for certain spatial join operators and geometry types by default.

HeavyDB - Fixed Issues

  • Fix potential crash when concatenating strings with the output of a UDF.

  • Fixes an issue where deleted rows with malformed data can prevent ALTER COLUMN TYPE command execution.

  • Fixes an error that could occur when parsing odbcinst.ini configuration files containing only one installed driver entry.

  • Fixes a table data corruption issue that could occur when the server crashes multiple times while executing write queries.

  • Fixes a crash that could occur when attempting to do a union of a string dictionary encoded text column and a none encoded text column.

  • Fixes a crash that could occur when the output of a table function is used as an argument to the strtok_to_array function.

  • Fixes a crash that could occur for queries involving projections of both geometry columns and geometry function expressions.

  • Fixes an issue where wrong results could be returned when the output of the DATE_TRUNC function is used as an argument to the count distinct function.

  • Fixes an issue where an error occurs if the COUNT_IF function is used in an arithmetic expression.

  • Fixes a crash that could occur when the WIDTH_BUCKET function is called with decimal columns.

  • Fixes an issue where the WIDTH_BUCKET function could return wrong results when called with decimal values close to the upper and lower boundary values.

  • Fixes a crash that could occur for queries with redundant projection steps in the query plan.

Heavy Render - Fixed Issues

  • Fixes a crash that could occur on multi-gpu systems while handling an out of GPU memory error.

Heavy Immerse - New Features and Improvements

  • Zoom to filters, setting map bounding box to extent of current filter set.

  • Image preview in map chart popups where image URLs are present.

Heavy Immerse - Fixed Issues

  • Fixed error thrown by choropleth chart on polygon hover.

Release 7.0.2 - June 28, 2023

HeavyDB - New Features and Improvements

  • Adds support for nested window function expressions.

  • Adds support for exception propagation from table functions.

HeavyDB - Fixed Issues

  • Fixes a crash that could occur when accessing 8-bit or 16-bit string dictionary encoded text columns on ODBC backed foreign tables.

  • Fixes unexpected GPU execution and memory allocations that could occur when executing sort queries with the CPU mode query hint.

  • Fixes an issue that could occur when inserting empty strings for geometry columns.

  • Fixes an issue that could occur when out of bounds fragment sizes are specified on table creation.

  • Fixes an issue where system dashboards could contain unexpected cached data.

  • Fixes a crash that could occur when executing aggregate functions over the result of join operations on scalar subqueries.

  • Fixes a server hang that could occur when GPU code compilation errors occur for user-defined table functions.

  • Fixes a data race that could occur when logging query plan cache size.

Heavy Render - New Features and Improvements

  • Add support for rendering 1D “terrain” cross-section overlays.

  • Rewrite 2D cross-section mesh generation as a table function.

  • Further improvements to system state logging when a render out of memory error occurs, and move it to the ERROR log for guaranteed visibility.

  • Enable auto-clear-render-mem by default for any render-vega call taking < 10 seconds.

Heavy Render - Fixed Issues

  • Render requests with 0 width or height could lead to a CHECK failure in encodePNG. Invalid image sizes now throw a non-fatal error during vega parsing.

Heavy Immerse - New Features and Improvements

  • Visualize terrain at the base of atmospheric cross sections in the Cross Section chart with the new Base Terrain chart layer type.

Heavy Immerse - Fixed Issues

  • Fixed local timezone issue with Chart Animation using cross filter replay.

Release 7.0.1 - June 8, 2023

HeavyDB - New Features and Improvements

  • Improves instrumentation around CPU and GPU memory utilization and certain crash scenarios.

HeavyDB - Fixed Issues

  • Fixes a crash that could occur for GPU executed join queries on dictionary encoded text columns with NULL values.

Heavy Render - New Features and Improvements

  • Improve instrumentation and logging related to gpu memory utilization, particularly with polygon rendering, as well as command timeout issues

Heavy Render - Fixed Issues

  • Fix a potential segfault when a Vulkan device lost error occurs

Release 7.0.0 - May 1, 2023

HeavyDB - New Features and Improvements

IMPORTANT - In HeavyDB Release 7.0, the “render groups” mechanism, part of the previous implementation of polygon rendering, has been removed. When you upgrade to HeavyDB Release 7.0, all existing tables that have a POLYGON or MULTIPOLYGON geo column are automatically migrated to remove a hidden column containing "render groups" metadata.

This operation is performed on all tables in all catalogs at first startup, and the results are recorded in the INFO log.

Once a table has been migrated in this manner, it is not backwards-compatible with earlier versions of HeavyDB. If you revert to an earlier version, the table may appear to have missing columns and behavior will be undefined. Attempting to query or render the POLYGON or MULTIPOLYGON data with the earlier version may fail or cause a server crash.

As always, HEAVY.AI strongly recommends that all databases be backed up, or at the very least, dumps are made of tables with POLYGON or MULTIPOLYGON columns using the existing HeavyDB version, before upgrading to HeavyDB Release 7.0.

Dumps of POLYGON and MULTIPOLYGON tables made with earlier versions can still be restored into HeavyDB Release 7.0. The superfluous metadata is automatically discarded. However, dumps of POLYGON and MULTIPOLYGON tables made with HeavyDB Release 7.0 are not backwards-compatible with earlier versions.

This applies only to tables with POLYGON or MULTIPOLYGON columns. Tables that contain other geo column types (POINT, LINESTRING, etc.), or only non-geo column types, do not require migration and remain backwards-compatible with earlier releases.

For Ubuntu installations, install libncurses5 with the following command:

sudo apt install libncurses5

  • Adds new Executor Resource Manager enabling parallel CPU and CPU-GPU query execution, and support for CPU execution on data inputs larger than fit in memory.

  • Adds HeavyML, a suite of machine learning capabilities accessible directly in SQL, including support for linear regression, random forest, gradient boosted trees, and decision tree regression models, and KMeans and DBScan clustering methods. (BETA)

  • Adds HeavyConnect support for MULTIPOINT and MULTILINESTRING columns.

  • Adds ALTER COLUMN TYPE support for text columns.

  • Adds a REASSIGN ALL OWNED command that allows for object ownership change across all databases.

  • Adds an option for validating POLYGON and MULTIPOLYGON columns when importing using the COPY FROM command or when using HeavyConnect.

  • Adds support for CONDITIONAL_CHANGE_EVENT window function.

  • Adds support for automatic casting of table function CURSOR arguments.

  • Adds support for Column<GeoMultiPolygon>, Column<GeoMultiLineString>, and Column<GeoMultiPoint> table function inputs and outputs.

  • Adds support for none encoded text column, geometry column, and array column projections from the right table in left join queries.

  • Adds support for literal text scalar subqueries.

  • Adds support for ST_X and ST_Y function output cast to text.

  • Improves concurrent execution of DDL and SHOW commands.

  • Improves error messaging for when the storage directory is missing.

  • Optimizes memory utilization for auto-vacuuming after delete queries.

HeavyDB - Fixed Issues

  • Fixes an issue where the root user could be deleted in certain cases.

  • Fixes an issue where staging directories for S3 import could remain when imports failed.

  • Fixes a crash that could occur when accessing the "tables" system table on instances containing tables with many columns.

  • Fixes a crash that could occur when accessing CSV and regex parsed file foreign tables that previously errored out during cache recovery.

  • Fixes an issue where dumping table foreign tables would produce an empty table.

  • Fixes an intermittent crash that could occur when accessing CSV and regex parsed file foreign tables that are backed by large files.

  • Fixes a "Ran out of slots in the query output buffer" exception that could occur when using stale cached cardinality values.

  • Fixes an issue where user defined table functions are erroneously categorized as ambiguous.

  • Fixes an error that could occur when a group by clause includes an alias that matches a column name.

  • Fixes a crash that could occur on GPUs with the Pascal architecture when executing join queries with case expression projections.

  • Fixes a crash that could occur when using the LAG_IN_FRAME window function.

  • Fixes a crash that could occur when projecting geospatial columns from the tf_raster_contour_polygons table function.

  • Fixes an issue that could occur when calling window functions on encoded date columns.

  • Fixes a crash that could occur when the coalesce function is called with geospatial or array columns.

  • Fixes a crash that could occur when projecting case expressions with geospatial or array columns.

  • Fixes a crash that could occur due to rounding error when using the WIDTH_BUCKET function.

  • Fixes a crash that could occur in certain cases where left join queries are executed on GPU.

  • Fixes a crash that could occur for queries with joins on encoded date columns.

  • Fixes a crash that could occur when using the SAMPLE function on a geospatial column.

  • Fixes a crash that could occur for table functions with cursor arguments that specify no field type.

  • Fixes an issue where automatic casting does not work correctly for table function calls with ColumnList input arguments.

  • Fixes an issue where table function argument types are not correctly inferred when arithmetic operations are applied.

  • Fixes an intermittent crash that could occur for join queries due to a race condition when changing hash table layouts.

  • Fixes an out of CPU memory error that could occur when executing a query with a count distinct function call on a high cardinality column.

  • Fixes a crash that could occur when running a HeavyDB instance in read-only mode after previously executing write queries on tables.

  • Fixes an issue where the auto-vacuuming process does not immediately evict chunks that were pulled in for vacuuming.

  • Fixes a crash that could occur in certain cases when HeavyConnect is used with Parquet files containing null string values.

  • Fixes potentially inaccurate calculation of vertical attenuation from antenna patterns in HeavyRF.

Heavy Render - New Features and Improvements

  • Add support for rendering a 1d cross-section as a line

  • Package the Vulkan loader libVulkan1 alongside heavydb

Heavy Render - Fixed Issues

  • Fix a device lost error that could occur with complex polygon renders

Heavy Immerse - New Features and Improvements

  • Data source Joins as a new custom data source type. (BETA)

  • Adds improved query performance defaults for the Contour Chart.

  • Adds access to new control panel to users with role "immerse_control_panel", even if the user is not a superuser.

  • Adds custom naming of map layers.

  • Adds custom map layer limit option using flag “ui/max_map_layers” which can be set explicitly (defaults to 8) or to -1 to remove the limit.

Heavy Immerse - Fixed Issues

  • Renames role from “immerse_trial_mode” to “immerse_export_disabled” and renames corresponding flag from “ui/enable_trial_mode” to “ui/user_export_disabled”.

  • Various minor UI fixes and polishing.

  • Fixes an issue where changing parameter value causes Choropleth popup to lose selected popup columns.

  • Fixes an issue where changing parameter value causes Pointmap to lose selected popup columns.

  • Fixes an issue where building a Skew-T chart results in a blank browser page.

  • Fixes an issue where Skew-T chart did not display wind barbs.

  • Fixes an issue with default date and time formatting.

  • Fixes an issue where setting flag "ui/enable_map_exports" to false unexpectedly disabled table chart export.

  • Fixes an issue with date filter presets.

  • Fixes an issue where filters "Does Not Contain" or "Does not equal" did not work on Crosslinked Columns.

  • Fixes an issue where charts were not redrawing to show the current bounding box filter set by the Linemap chart.

Release 6.4.4 - May 2, 2023

HeavyDB - New Features and Improvements

  • Adds support for literal text scalar subqueries.

HeavyDB - Fixed Issues

  • Fixes a crash that could occur due to rounding error when using the WIDTH_BUCKET function.

  • Fixes a crash that could occur for queries with joins on encoded date columns.

  • Fixes a crash that could occur when running a HeavyDB instance in read-only mode after previously executing write queries on tables.

  • Fixes an issue where the auto-vacuuming process does not immediately evict chunks that were pulled in for vacuuming.

Heavy Immerse - Fixed Issues

  • Fixed issue where Skew-T chart would not render when nulls were used in selected data.

  • Fixed issue where wind barbs were not visible on Skew-T chart.

Release 6.4.3 - February 27, 2023

Heavy Immerse - New Features and Improvements

  • Added feature flag ui/session_create_timeout with a default value of 10000 (10 seconds) for modifying login request timeout.

Release 6.4.2 - February 15, 2023

HeavyDB - New Features and Improvements

HeavyDB - Fixed Issues

  • Fixes a crash that could occur when S3 CSV-backed foreign tables with append refreshes are refreshed multiple times.

  • Fixes a crash that could occur when foreign tables with geospatial columns are refreshed after cache evictions.

  • Fixes a crash that could occur when querying foreign tables backed by Parquet files with empty row groups.

  • Fixes an error that could occur when select queries used in ODBC foreign tables reference case sensitive column names.

  • Fixes a crash that could occur when CSV backed foreign tables with geospatial columns are refreshed without updates to the underlying CSV files.

  • Fixes a crash that could occur in heavysql when executing the \detect command with geospatial files.

  • Fixes a casting error that could occur when executing left join queries.

  • Fixes a crash that could occur when accessing the disk cache on HeavyDB servers with the read-only configuration parameter enabled.

  • Fixes an error that could occur when executing queries that project geospatial columns.

  • Fixes a crash that could occur when executing the EXTRACT function with the ISODOW date_part parameter on GPUs.

  • Fixes an error that could occur when importing CSV or Parquet files with text columns containing more than 32,767 characters into HeavyDB NONE ENCODED text columns.

Heavy Render - Fixed Issues

  • Fixes a Vulkan Device Lost error that could occur when rendering complex polygon data with thousands of polygons in a single pixel.

Release 6.4.1 - January 30, 2023

HeavyDB - New Features and Improvements

  • Optimizes result set buffer allocations for CPU group by queries.

  • Enables trimming of white spaces in quoted fields during CSV file imports, when both the trim_spaces and quoted options are set.

HeavyDB - Fixed Issues

  • Fixes an error that could occur when importing CSV files with quoted fields that are surrounded by white spaces.

  • Fixes a crash that could occur when tables are reordered for range join queries.

  • Fixes a crash that could occur for join queries with intermediate projections.

  • Fixes a crash that could occur for queries with geospatial join predicate functions that use literal parameters.

  • Fixes an issue where queries could intermittently and incorrectly return error responses.

  • Fixes an issue where queries could return incorrect results when filter push-down through joins is enabled.

  • Fixes a crash that could occur for queries with join predicates that compare string dictionary encoded and nonencoded text columns.

  • Fixes an issue where hash table optimizations could ignore the max-cacheable-hashtable-size-bytes and hashtable-cache-total-bytes server configuration parameters.

  • Fixes an issue where sharded table join queries that are executed on multiple GPUs could return incorrect results.

  • Fixes a crash that could occur when sharded table join queries are executed on multiple GPUs with the from-table-reordering server configuration parameter enabled.

Heavy Immerse - New Features and Improvements

  • Multilayer support for Contour and Windbarb charts.

  • Support custom SQL measures in Contour charts.

Heavy Immerse - Fixed Issues

  • Allow MULTILINESTRING to be used in selectors for Linemap charts.

  • Allow MULTILINESTRING to be used in Immerse SQL Editor.

Release 6.4.0 - December 16, 2022

This release features general availability of data connectors for PostGreSQL, beta Immerse connectors for Snowflake and Redshift, and SQL support for Google BigQuery and Hive (beta). These managed data connections let you use HEAVY.AI as an acceleration platform wherever your source data may live. Scheduling and automated caching ensure that fast analytics are always running on the latest available data.

Immerse features four new chart types: Contour, Cross-section, Wind barb, and Skew-t. While especially useful for atmospheric and geotechnical data visualization, Contour and Cross-section also have more general application.

Major improvements for time series analysis have been added. This includes an Immerse user interface for time series, and a large number of SQL window function additions and performance enhancements.

The release also includes two major architectural improvements:

  • The ability to perform cross-database queries, both in SQL and in Immerse, increasing flexibility across the board. For example, you can now easily build an Immerse dashboard showing system usage combined with business data. You might also make a read-only database of data shared across a set of users.

  • Render queries no longer block other GPU queries. In many use cases, renders can be significantly slower than other common queries. This should result in significant performance gains, particularly in map-heavy dashboards.

HeavyDB - New Features and Improvements

Core SQL

  • Adds support for cross database SELECT, UPDATE, and DELETE queries.

  • Support for MODE SQL aggregate.

  • Add support for strtok_to_array.

  • Support for ST_NumGeometries().

  • Support ST_TRANSFORM applied to literal geo types.

  • Enhanced query tracing ensures all child operations for a query_id are properly logged with that ID.

Data Linking and Import

  • Adds support for BigQuery and Hive HeavyConnect and import.

  • Adds support for table restore from S3 archive files.

  • Improves integer column type detection in Snowflake import/HeavyConnect data preview.

  • Adds HeavyConnect and import support for Parquet required scalar fields.

  • Improves import status error message when an invalid request is made.

Table Function Enhancements

  • Support POINT, LINESTRING, and POLYGON input and output types in table functions.

  • Support default values for scalar table function arguments.

  • Add tf_raster_contour table function to generate contours given x, y, and z arguments. This function is exposed in Immerse, but has additional capabilities available in SQL, such as supporting floating point contour intervals.

  • Return file path and file name from tf_point_cloud_metadata table function.

  • Previous length limit of 32K characters per values for none-encoded text columns has been lifted, and now none-encoded text values can be up to 2^31 - 1 characters (approximately 2.1billion characters).

  • Support array column outputs from table functions.

  • Add TEXT ENCODING DICT and Array<TEXT ENCODING DICT> type support for runtime functions/UDFs.

  • Allow transient TEXT ENCODING DICT column inputs into table functions.

Window Function Enhancements

  • Support COUNT_IF function.

  • Support SUM_IF function.

  • Support NTH_VALUE window function.

  • Support NTH_VALUE_IN_FRAME window function.

  • Support FIRST_VALUE_IN_FRAME and LAST_VALUE_IN_FRAME window functions.

  • Support CONDITIONAL_TRUE_EVENT.

  • Support ForwardFill and BackwardFill window functions to fill in missing (null) values based on previous non-null values in window.

HeavyDB - Fixed Issues

  • Fixes an issue where databases with duplicate names but different capitalization could be created.

  • Fixes an issue where raster imports could fail due to inconsistent band names.

  • Fixes an issue that could occur when DUMP/RESTORE commands were executed concurrently.

  • Fixes an issue where certain session updates do not occur when licenses are updated.

  • Fixes an issue where import/HeavyConnect data preview could return unsupported decimal types.

  • Fixes an issue where import/HeavyConnect data preview for PostgreSQL queries involving variable length columns could result in an error.

  • Fixes an issue where NULL elements in array columns with the NOT NULL constraint were not projected correctly.

  • Fixes a crash that could occur in certain scenarios where UPDATE and DELETE queries contain subqueries.

  • Fixes an issue where ingesting ODBC unsigned SQL_BIGINT into HeavyDB BIGINT columns using HeavyConnect or import could result in storage of incorrect data.

  • Fixes a crash that could occur in distributed configurations, when switching databases and accessing log based system tables with rolled off logs.

  • Fixes an error that occurred when importing Parquet files that did not contain statistics metadata.

  • Ensure query hint is propagated to subqueries.

  • Fix crash that could occur when LAG_IN_FRAME or LEAD_IN_FRAME were missing order-by or frame clause.

  • Fix bug where LAST_VALUE window function could return wrong results.

  • Fix issue where “Cannot use fast path for COUNT DISTINCT” could be reported from a count distinct operation.

  • Various bug fixes for support of VALUES() clause.

  • Improve handling of generic input expressions for window aggregate functions.

  • Fix bug where COUNT(*) and COUNT(1) over window frame could cause crash.

  • Fix wrong coordinate used for origin_y_bin in tf_raster_graph_shortest_slope_weighted_path.

  • Speed up table function binding in cases with no ColumnList arguments.

  • Support arrays of transient encoded strings into table functions.

Heavy Render - New Features and Improvements

Render queries no longer block parallel execution queue for other queries.

Heavy Immerse - New Features and Improvements

  • The Immerse PostgreSQL connector is now generally available, and is joined by public betas of Redshift and Snowflake.

  • New chart types:

    • Contour chart. Contours can be applied to any geo point data, but are especially useful when applied to smoothly-varying pressure and elevation data. They can help reveal general patterns even in noisy primary data. Contours can be based on any point data, including that from regular raster grids like a temperature surface, or from sparse points like LiDAR data.

    • Cross-section chart. As the name suggests, this allows a new view on 2.5D or 3D datasets, where a selected data dimension is plotted on the vertical axis for a slice of geographic data. In addition to looking in profile at parts of the atmosphere in weather modeling, this can also be used to look at geological sections below terrain.

    • Representing vector force fields takes a step forward with the Wind barb plot. Wind barbs are multidimensional symbols which convey at a glance both strength and direction.

    • Skew-T is a highly specialized multidimensional chart used primarily by meteorologists. Skew-Ts are heavily used in weather modeling and can help predict, for example, where thunderstorms or dry lightning are likely to occur.

  • Initial support for window functions in Immerse, enabling time lag analysis in charts. For example, you can now plot month-over-month or quarter-over-quarter sales or web traffic volume.

  • For categorical data, in addition to supporting aggregations based on the number of unique values, MODE is now supported. This supports the creation of groups based on the most-common value.

Release 6.2.7 - November 1, 2022

HeavyDB - Fixed Issues

  • Fixed an issue where a restarted server can potentially deadlock if the first two queries are executed at the same time and use different executors.

Release 6.2.5 - October 26, 2022

HeavyDB - Fixed Issues

  • Fixed an issue where COUNT DISTINCT or APPROX_COUNT_DISTINCT, when run on a CASE statement that outputs literal strings, could cause a crash.

Release 6.2.4 - October 12, 2022

HeavyDB - Fixed Issues

  • Fixes a crash when using COUNT() or COUNT(1) with the window function, i.e., COUNT(*) OVER (PARTITION BY x).

  • Fixes an incorrect result when using a date column as a partition key, like SUM(x) OVER (PARTITION BY DATE_COL).

  • Improves the performance of window functions when a literal expression is used as one of the input expressions of window functions like LAG(x, 1).

  • Improves query execution preparation phase by preventing redundant processing of the same nodes, especially when a complex input query is evaluated.

  • Fixes geometry type checking for range join operator that could cause a crash in some cases.

  • Resolves a query that may return an incorrect result when it has many projection expressions (for example, more than 50 8-byte output expressions) when using a window function expression.

  • Fixes an issue where the Resultset recycler ignores the server configuration size metrics.

  • Fixes a race condition where multiple catalogs could be created on initialization, resulting in possible deadlocks, server hangs, increased memory pressure, and slow performance.

Release 6.2.1 - September 27, 2022

HeavyDB - Fixed Issues

  • Fixes a crash encountered during some SQL queries when the read-only setting was enabled.

  • Fixes an issue in tf_raster_graph_shortest_slope_weighted_path table function that would lead some inputs to be incorrectly rejected.

6.2.0 - September 23, 2022

In Release 6.2.0, Heavy Immerse adds animation and a control panel system. HeavyConnect now includes connectors for Redshift, Snowflake, and PostGIS. The SQL system is extended with support for casting and time-based window functions. GeoSQL gets direct LiDAR import, multipoints, and multilinestrings, as well as graph network algorithms. Other enhancements include performance improvements and reduced memory requirements across the product.

HeavyDB - New Features and Improvements

SQL Improvements

  • TRY_CAST support for string to numeric, timestamp, date, and time casts.

  • Implicit and explicit CAST support for numeric, timestamp, date, and time to TEXT type.

  • CAST support from Timestamp(0|3|6|9) types to Time(0) type.

  • Concat (||) operator now supports multiple nonliteral inputs.

  • JSON_VALUE operator to extract fields from JSON string columns.

  • BASE64_ENCODE and BASE64_DECODE operators for BASE64 encoding/decoding of string columns.

  • POSITION operator to extract index of search string from strings.

  • Add hash-based count distinct operator to better handle case of sparse columns.

Geospatial

  • Support MULTILINESTRING OGC geospatial type.

  • Support MULTIPOINT OGC geospatial type.

  • Support ST_NumGeometries.

  • Support ST_ConvexHull and ST_ConcaveHull.

  • Improved table reordering to maximize invocation of accelerated geo joins.

  • Support ST_POINT, ST_TRANSFORM and ST_SETSRID as expressions for probing columns in point-to-point distance joins.

  • Support accelerated overlaps hash join for ST_DWITHIN clause comparing two POINT columns.

  • Support for POLYGON to MULTIPOLYGON promotion in SQLImporter.

Window Functions

  • RANGE window function FRAME support for Time, Date, and Timestamp types.

  • Support LEAD_IN_FRAME / LAG_IN_FRAME window functions that compute LEAD / LAG in reference to a window frame.

Extension Functions

  • Add TextEncodingNone support for scalar UDF and extension functions.

  • Support array inputs and outputs to table functions.

  • Support literal interval types for UDTFs.

  • Add support for table functions range annotations for literal inputs

Performance and Control

  • Make max CPU threads configurable via a startup flag.

  • Support array types for Arrow/select_ipc endpoints.

  • Add support for query hint to control dynamic watchdog.

  • Add query hint to control Cuda block and grid size for query.

  • Adds an echo all option to heavysql that prints all executed commands and queries.

  • Improved decimal precision error messages during table creation.

HeavyConnect

  • Add support for file roll offs to HeavyConnect local and S3 file use cases.

  • Add HeavyConnect support for non-AWS S3-compatible endpoints.

Advanced Analytics

LiDAR

  • Add tf_point_cloud_metadata table function to read metadata from one or more LiDAR/point cloud files, optionally filtered by a bounding box.

  • Add tf_load_point_cloud table function to load data from one or more LiDAR/point cloud files, optionally filtered by bounding box and optionally cached in memory for subsequent queries.

Graph and Path Functions

  • Add tf_graph_shortest_path table function to compute shortest edge-weighted path between two points in a graph constructed from an input edge list

  • Add tf_graph_shortest_paths_distances table function to compute the shortest edge-weighted distances between a starting point and all other points in a graph constructed from an input edge list.

  • Add tf_grid_graph_shortest_slope_weighted_path table function to compute the shortest slope-weighted path between two points along rasterized data.

Enhanced Spatial Aggregations

  • Support configurable aggregation types for tf_geo_rasterize and tf_geo_rasterize_slope table functions, allowing for AVG, MIN, MAX, SUM, and COUNT aggregations.

  • Support two-pass gaussian blur aggregation post-processing for tf_geo_rasterize and tf_geo_rasterize_slope table functions.

RF Propagation Extension Improvements

  • Add dynamic ray splitting to tf_rf_prop_max_signal table function for improved performance and terrain coverage.

  • Add variant of tf_rf_prop_max_signal table function that takes per-RF source/tower transmission power (watts) and frequency (MHz).

  • Add variant of generate_series table function that generates series of timestamps between a start and end timestamp at specified time intervals.

Fixed Issues

  • ST_Centroid now automatically picks up SRID of underlying geometry.

  • Fixed a crash that occurred when ST_DISTANCE had an ST_POINT input for its hash table probe column.

  • Fixed an issue where a query hint would not propagate to a subquery.

  • Improved overloaded table function type deduction eliminates type mismatches when table function outputs are used downstream.

  • Properly handle cases of RF sources outside of terrain bounding box for tf_rf_prop_max_signal.

  • Fixed an issue where specification of unsupported GEOMETRY column type during table creation could lead to a crash.

  • Fixed a crash that could occur due to execution of concurrent create and drop table commands.

  • Fixed a crash that could occur when accessing the Dashboards system table.

  • Fixed a crash that could occur as a result of type mismatches in ITAS queries.

  • Fixed an issue that could occur due to band name sanitization during raster imports.

  • Fixed a memory leak that could occur when dropping temporary tables.

  • Fixed a crash that could occur due to concurrent execution of a select query and long-running write query on the same table.

Heavy Render - New Features and Improvements

  • Disables render group assignment by default.

  • Supports rendering of MULTILINESTRING geometries.

  • Memory footprint required for compositing renders on multi-GPU systems is significantly reduced. Any multi-GPU system will see improvements, but is most noticeable on systems with 4 or more GPUs. For example, rendering a 1400 x 1400 image results in ~450mb of memory saved when using 8 GPUs for a query. Multi-gpu system configurations should be able to set the res-gpu-mem configuration flag value lower as a result, freeing memory for other subsystems.

  • Adds INFO logging of peak render memory usage for the lifetime of the server process. The render memory logged is peak render query output buffer size (controlled with the render-mem-bytes configuration flag) and peak render buffer usage (controlled with the res-gpu-mem configuration flag). These peaks are logged in the INFO log on server shutdown, when GPU memory is cleared via clear_gpu_memory endpoint, or when a new peak is reached. These logged peaks can be useful to adjust the render-mem-bytes and res-gpu-mem configuration flags to improve memory utilization by avoiding reserving memory that might go unused. Examples of the log messages:

    • When a new peak render-mem-bytes is reached: New peak render buffer usage (render-mem-bytes):37206200 of 1000000000

    • When a new peak res-gpu-mem is reached: New peak render memory usage (res-gpu-mem): 166033024

    • Peaks logged on server shutdown or on clear_gpu_memory: Render memory peak utilization:

      Query result buffer (render-mem-bytes): 37206200 of 1000000000 Images and buffers (res-gpu_mem): 660330240 Total allocated: 1660330240

Heavy Render - Fixed Issues

  • Fixed an issue the occurred when trying to hit-test a multiline SQL expression.

Heavy Immerse - New Features and Improvements

  • Dashboard and chart image export

  • Crossfilter replay

  • Improved popup support in the base 3D chart

  • New Multilayer CPU rendered Geo charts: Pointmap, Linemap, and Choropleth (Beta)

  • Control Panel (Beta)

  • Redshift, Snowflake, and PostGIS HeavyConnect support (Beta)

  • Skew-T chart (Beta)

  • Support for limiting the number of charts in a dashboard through the ui/limit_charts_per_dashboard feature flag. The default value is 0 (no limit).

Heavy Immerse - Fixed Issues

  • Fixed duplicate column names importer error.

  • Various bug fixes and user-interface improvements.

Install NVIDIA Drivers and Vulkan on CentOS/RHEL

Install Prerequisites

Install the Extra Packages for Enterprise Linux (EPEL) repository and other packages before installing NVIDIA drivers.

For CentOS, use yum to install the epel-release package.

sudo yum install epel-release

Use the following install command for RHEL.

sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum upgrade kernel
sudo reboot now

Install Kernel Headers

Install kernel headers and development packages:

sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

If installing kernel headers does not work correctly, follow these steps instead:

  1. Identify the Linux kernel you are using by issuing the uname -r command.

  2. Use the name of the kernel (3.10.0-862.11.6.el7.x86_64 in the following code example) to install kernel headers and development packages:

sudo yum install \
kernel-devel-3.10.0-862.11.6.el7.x86_64 \
kernel-headers-3.10.0-862.11.6.el7.x86_64

Install the dependencies and extra packages:

sudo yum install kernel-devel kernel-headers pciutils dkms

Install NVIDIA Drivers and Vulkan

Although using the NVIDIA website is more time consuming and less automated, you are assured that the driver is certified for your GPU. Use this method if you are not sure which driver to install. If you prefer a more automated method and are confident that the driver is certified, you can use the package-manager method.

Install NVIDIA Drivers Using the NVIDIA Website

If you do not know the GPU model installed on your system, run this command:

lspci -v | egrep "3D|VGA*.NVIDIA" | awk -F '\[|\]' ' { print $2 } '

The output shows the product type, series, and model. In this example, the product type is Tesla, the series is T (as Turing), and the model is T4.

Tesla T4
  1. Select the product type shown after running the command above.

  2. Select the correct product series and model for your installation.

  3. In the Operating System dropdown list, select Linux 64-bit.

  4. In the CUDA Toolkit dropdown list, click a supported version (11.4 or higher).

  5. Click Search.

  6. On the resulting page, verify the download information and click Download.

Move the downloaded file to the server, change the permissions, and run the installation.

chmod +x NVIDIA-Linux-x86_64-*.run
sudo ./NVIDIA-Linux--x86_64-*.runYou might get the following error during installation:

You might receive the following error during installation:

ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.

If you receive this error, blacklist the Nouveau driver by editing the /etc/modprobe.d/blacklist-nouveau.conffile, adding the following lines at the end:

blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off alias lbm-nouveau off

Install NVIDIA Drivers Using Yum

Install a specific version of the driver for your GPU by installing the NVIDIA repository and using the yum package manager.

Add the NVIDIA network repository to your system.

sudo yum-config-manager --add-repo \
https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo

Run the available drivers for the download.

yum list nvidia-driver-branch-[0-9][0-9][0-9].x86_64

Available Packages
nvidia-driver-branch-418.x86_64    3:418.226.00-1.el7     cuda-rhel7-x86_64 
nvidia-driver-branch-440.x86_64    3:440.118.02-1.el7     cuda-rhel7-x86_64 
nvidia-driver-branch-450.x86_64    3:450.191.01-1.el7     cuda-rhel7-x86_64 
nvidia-driver-branch-455.x86_64    3:455.45.01-1.el7      cuda-rhel7-x86_64 
nvidia-driver-branch-460.x86_64    3:460.106.00-1.el7     cuda-rhel7-x86_64 
nvidia-driver-branch-465.x86_64    3:465.19.01-1.el7      cuda-rhel7-x86_64 
nvidia-driver-branch-495.x86_64    3:495.29.05-1.el7      cuda-rhel7-x86_64 
nvidia-driver-branch-510.x86_64    3:510.73.08-1.el7      cuda-rhel7-x86_64 
nvidia-driver-branch-515.x86_64    3:515.43.04-1.el7      cuda-rhel7-x86_64 

Install the driver version needed with yum.

sudo yum install nvidia-driver-branch-<version>.x86_64

Reboot your system to ensure that the new version of the driver is loaded.

sudo reboot

Check NVIDIA Driver Installation

Run nvidia-smi to verify that your drivers are installed correctly and recognize the GPUs in your environment. Depending on your environment, you should see something like this to confirm that your NVIDIA GPUs and drivers are present:

If you see an error like the following, the NVIDIA drivers are probably installed incorrectly:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. 
Make sure that the latest NVIDIA driver is installed and running.

Install Vulkan

To work correctly, the back-end renderer requires a Vulkan-enabled driver and the Vulkan library. Without these components, the database cannot start without disabling the back-end renderer.

Install the Vulkan library and its dependencies using yum both CentOS and RHEL.

sudo yum install vulkan

If installing on RHEL, you must obtain and manually install the vulkan-filesystem package manually. Perform these additional steps:

  1. Download the rpm file

    wget http://mirror.centos.org/centos/7/os/x86_64/Packages/vulkan-filesystem-1.1.97.0-1.el7.noarch.rpm
  2. Install the rpm file

    sudo rpm --install vulkan-filesystem-1.1.97.0-1.el7.noarch.rpm

You might see a warning similar to the following:

warning: cuda-repo-rhel7-10.0.130-1.x86_64.rpm: Header V3 RSA/SHA512 Signature, key ID 7fa2af80: NOKEY

Install CUDA Toolkit ᴼᴾᵀᴵᴼᴺᴬᴸ

You must install the CUDA Toolkit if you use advanced features like C++ User-Defined Functions or User-Defined Table Functions to extend the database capabilities.

  1. Add the NVIDIA network repository to your system:

sudo yum-config-manager --add-repo \
https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo

2. List the available CUDA Toolkit versions:

yum list cuda-toolkit-* | egrep -v config

Available Packages
cuda-toolkit-10-0.x86_64                     10.0.130-1        cuda-rhel7-x86_64
cuda-toolkit-10-1.x86_64                     10.1.243-1        cuda-rhel7-x86_64
cuda-toolkit-10-2.x86_64                     10.2.89-1         cuda-rhel7-x86_64
cuda-toolkit-11-0.x86_64                     11.0.3-1          cuda-rhel7-x86_64
cuda-toolkit-11-1.x86_64                     11.1.1-1          cuda-rhel7-x86_64
cuda-toolkit-11-2.x86_64                     11.2.2-1          cuda-rhel7-x86_64
cuda-toolkit-11-3.x86_64                     11.3.1-1          cuda-rhel7-x86_64
cuda-toolkit-11-4.x86_64                     11.4.4-1          cuda-rhel7-x86_64
cuda-toolkit-11-5.x86_64                     11.5.2-1          cuda-rhel7-x86_64
cuda-toolkit-11-6.x86_64                     11.6.2-1          cuda-rhel7-x86_64

3. Install the CUDA Toolkit using yum:

sudo yum install cuda-toolkit-<version>.x86_64

4. Check that everything is working correctly:

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0

HEAVY.AI Installation on CentOS/RHEL

This is an end-to-end recipe for installing HEAVY.AI on a CentOS/RHEL 7 machine using CPU and GPU devices.

The order of these instructions is significant. To avoid problems, install each component in the order presented.

Assumptions

These instructions assume the following:

  • You are installing on a “clean” CentOS/RHEL 7 host machine with only the operating system installed.

  • Your HEAVY.AI host only runs the daemons and services required to support HEAVY.AI.

  • Your HEAVY.AI host is connected to the Internet.

Preparation

Prepare your Centos/RHEL machine by updating your system and optionally enabling or configuring a firewall.

Update and Reboot

Update the entire system and reboot the system if needed.

sudo yum update
sudo reboot

Install the utilities needed to create HEAVY.AI repositories and download archives

sudo yum install yum-utils curl

JDK

  1. Open a terminal on the host machine.

  2. Install the headless JDK using the following command:

sudo yum install java-1.8.0-openjdk-headless

Create the HEAVY.AI User

Create a group called heavyai and a user named heavyai, who will own HEAVY.AI software and data on the file system.

You can create the group, user, and home directory using the useradd command with the --user-group and --create-home switches:

sudo useradd --user-group --create-home --groups wheel heavyai

Set a password for the user:

sudo passwd heavyai

Log in with the newly created user:

sudo su - heavyai

Installation

Install HEAVY.AI using yum or a tarball.

The installation using the yum package manager is recommended to those who want a more automated install and upgrade procedure.

Install Nvidia Drivers ᴳᴾᵁ ᴼᴾᵀᴵᴼᴺ

If your system includes NVIDIA GPUs, but the drivers are not installed, install them now.

See Install NVIDIA Drivers and Vulkan on CentOS/RHEL for details.

Installing with Yum

Create a yum repository depending on the edition (Enterprise, Free, or Open Source) and execution device (GPU or CPU) you are going to use.

sudo yum-config-manager --add-repo \
https://releases.heavy.ai/ee/yum/stable/cuda
sudo yum-config-manager --add-repo \
https://releases.heavy.ai/ee/yum/stable/cpu
sudo yum-config-manager --add-repo \
https://releases.heavy.ai/os/yum/stable/cuda
sudo yum-config-manager --add-repo \
https://releases.heavy.ai/os/yum/stable/cpu

Add the GPG-key to the newly added repository.

sudo yum-config-manager --save \
--setopt="releases.heavy*.gpgkey=https://releases.heavy.ai/GPG-KEY-heavyai"

Use yum to install the latest version of HEAVY.AI.

sudo yum install heavyai.x86_64

If you need to install a specific version of HEAVY.AI because you are upgrading from Omnisci, or for different reasons, you must run the following command:

hai_version="6.0.0"
sudo yum install heavyai-$(yum --showduplicates list heavyai.x86_64 | \
grep $hai_version | tr -s " " | cut -f 2 -d ' ').x86_64

Installing with a Tarball

First create the installation directory.

sudo mkdir /opt/heavyai && sudo chown $USER /opt/heavyai

Download the archive and install the latest version of the software. A different archive is downloaded depending on the edition (Enterprise, Free, or Open Source) and the device used for runtime.

curl \
https://releases.heavy.ai/ee/tar/heavyai-ee-latest-Linux-x86_64-render.tar.gz \
| sudo tar zxf - --strip-components=1 -C /opt/heavyai
curl \
https://releases.heavy.ai/ee/tar/heavyai-ee-latest-Linux-x86_64-cpu.tar.gz \
| sudo tar zxf - --strip-components=1 -C /opt/heavyai
curl \
https://releases.heavy.ai/os/tar/heavyai-os-latest-Linux-x86_64.tar.gz \
| sudo tar zxf - --strip-components=1 -C /opt/heavyai
curl \
https://releases.heavy.ai/os/tar/heavyai-os-latest-Linux-x86_64-cpu.tar.gz \
| sudo tar zxf - --strip-components=1 -C /opt/heavyai

Configuration

Follow these steps to prepare your HEAVY.AI environment.

Set Environment Variables

For your convenience, you can update .bashrc with these environment variables

echo "# HEAVY.AI variable and paths
export HEAVYAI_PATH=/opt/heavyai
export HEAVYAI_BASE=/var/lib/heavyai
export HEAVYAI_LOG=$HEAVYAI_BASE/storage/log
export PATH=$HEAVYAI_PATH/bin:$PATH" \
>> ~/.bashrc
source ~/.bashrc

Although this step is optional, you will find references to the HEAVYAI_BASE and HEAVYAI_PATH variables. These variables contain respectively the paths where configuration, license, and data files are stored and where the software is installed. Installing them is strongly recommended.

Initialization

Run the systemd installer to initialize the HEAVY.AI services and the database storage.

cd $HEAVYAI_PATH/systemd
./install_heavy_systemd.sh

Accept the default values provided or make changes as needed.

The script creates a data directory in $HEAVYAI_BASE/storage (typically /var/lib/heavyai) with the directories catalogs, data, export and log. The directory import is created when you insert data the first time. If you are a HeavyDB administrator, the log directory is of particular interest.

Activation

Note that Heavy Immerse is not available in the OS SEdition, so if running the OSS Edition the systemctl command using the heavy_web_server has no effect.

Enable the automatic startup of the service at reboot and start the HEAVY.AI services.

sudo systemctl enable heavydb --now
sudo systemctl enable heavy_web_server --now
sudo systemctl enable heavydb --now

Configure the Firewall ᴼᴾᵀᴵᴼᴺᴬᴸ

If a firewall is not already installed and you want to harden your system, install and start firewalld.

sudo yum install firewalld
sudo systemctl start firewalld
sudo systemctl enable firewalld
sudo systemctl status firewalld

To use Heavy Immerse or other third-party tools, you must prepare your host machine to accept incoming HTTP(S) connections. Configure your firewall for external access:

sudo firewall-cmd --zone=public --add-port=6273-6274/tcp --add-port=6278/tcp --permanent
sudo firewall-cmd --reload

Most cloud providers use a different mechanism for firewall configuration. The commands above might not run in cloud deployments.

Licensing HEAVY.AI ᵉᵉ⁻ᶠʳᵉᵉ ᵒⁿˡʸ

  1. Open a terminal window.

  2. Enter cd ~/ to go to your home directory.

  3. Open .bashrc in a text editor. For example, vi .bashrc.

  4. Edit the .bashrc file. Add the following export commands under “User specific aliases and functions.”

  5. Save the .bashrc file. For example, in vi enter[esc]:x!

  6. Open a new terminal window to use your changes.

  7. Connect to Heavy Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273.

  8. When prompted, paste your license key in the text box and click Apply.

  9. Log into Heavy Immerse by entering the default username (admin) and password (HyperInteractive), and then click Connect.

The $HEAVYAI_BASE directory must be dedicated to HEAVYAI; do not set it to a directory shared by other packages.

Final Checks

Load Sample Data and Run a Simple Query

HEAVY.AI ships with two sample datasets of airline flight information collected in 2008, and a census of New York City trees. To install sample data, run the following command.

cd $HEAVYAI_PATH
sudo ./insert_sample_data --data /var/lib/heavyai/storage
#     Enter dataset number to download, or 'q' to quit:
Dataset           Rows    Table Name          File Name
1)    Flights (2008)    7M      flights_2008_7M     flights_2008_7M.tar.gz
2)    Flights (2008)    10k     flights_2008_10k    flights_2008_10k.tar.gz
3)    NYC Tree Census (2015)    683k    nyc_trees_2015_683k    nyc_trees_2015_683k.tar.gz

Connect to HeavyDB by entering the following command in a terminal on the host machine (default password is HyperInteractive):

$HEAVYAI_PATH/bin/heavysql -p HyperInteractive

Enter a SQL query such as the following:

SELECT origin_city AS "Origin", 
dest_city AS "Destination", 
AVG(airtime) AS "Average Airtime" 
FROM flights_2008_10k WHERE distance < 175 
GROUP BY origin_city, dest_city;

The results should be similar to the results below.

Origin|Destination|Average Airtime
Austin|Houston|33.055556
Norfolk|Baltimore|36.071429
Ft. Myers|Orlando|28.666667
Orlando|Ft. Myers|32.583333
Houston|Austin|29.611111
Baltimore|Norfolk|31.714286

After installing Enterprise or Free Edition, check if Heavy Immerse is running as intended.

  1. Connect to Heavy Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273.

  2. Log into Heavy Immerse by entering the default username (admin) and password (HyperInteractive), and then click Connect.

Create a new dashboard and a Scatter Plot to verify that backend rendering is working.

  1. Click New Dashboard.

  2. Click Add Chart.

  3. Click SCATTER.

  4. Click Add Data Source.

  5. Choose the flights_2008_10k table as the data source.

  6. Click X Axis +Add Measure.

  7. Choose depdelay.

  8. Click Y Axis +Add Measure.

  9. Choose arrdelay.

  10. Click Size +Add Measure.

  11. Choose airtime.

  12. Click Color +Add Measure.

  13. Choose dest_state.

The resulting chart shows, unsurprisingly, that there is a correlation between departure delay and arrival delay.

Create a new dashboard and a Table chart to verify that Heavy Immerse is working.

  1. Click New Dashboard.

  2. Click Add Chart.

  3. Click Bubble.

  4. Click Select Data Source.

  5. Choose the flights_2008_10k table as the data sour

  6. Click Add Dimension.

  7. Choose carrier_name.

  8. Click Add Measure.

  9. Choose depdelay.

  10. Click Add Measure.

  11. Choose arrdelay.

  12. Click Add Measure.

  13. Choose #Records.

The resulting chart shows, unsurprisingly, that also the average departure delay is correlated to the average of arrival delay, while there is quite a difference between Carriers.

¹ In the OS Edition, Heavy Immerse is unavailable.

² The OS Edition does not require a license key.

Free Version

HEAVY.AI Free is a full-featured version of the HEAVY.AI platform available at no cost for non-hosted commercial use.

To get started with HEAVY.AI Free:

  1. On the Get HEAVY.AI Free page, enter your email address and click I Agree.

  2. Open the HEAVY.AI Free Edition Activation Link email that you receive from HEAVY.AI, and click Click Here to view and download the free edition license. You will need this license to run HEAVY.AI after you install it. A copy of the license is also sent to your email.

Add Users

You can create additional HEAVY.AI users to collaborate with.

  1. Connect to Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273.

Install NVIDIA Drivers and Vulkan on Ubuntu

Installation Prerequisites

Upgrade the system and the kernel, then the machine if needed.

Install Kernel Headers

Install kernel headers and development packages.

Install the extra packages.

Installing Vulkan Library

The rendering engine of HEAVY.AI (present in Enterprise Editions) requires a Vulkan-enabled driver and the Vulkan library. Without these components, the database itself may not be able to start.

Install the Vulkan library and its dependencies using apt.

Installing NVIDIA Drivers

Installing NVIDIA drivers with support for the CUDA platform is required to run GPU-enabled versions of HEAVY.AI.

You can install NVIDIA drivers in multiple ways, we've outlined three available options below. If you would prefer not to decide, we recommend Option 1.

It is advisable to keep a record of the installation method used, as upgrading NVIDIA drivers will require the utilization of the same method for successful results.

What is CUDA? What is the CUDA toolkit?

The CUDA Toolkit from NVIDIA provides everything you need to develop GPU-accelerated applications. The CUDA Toolkit includes GPU-accelerated libraries, a compiler, development tools and the CUDA runtime. The CUDA Toolkit is not required to run HEAVY.AI, but you must install the CUDA toolkit if you use advanced features like C++ User-Defined Functions and or User-Defined Table Functions to extend the database capabilities.

Option 1: Install NVIDIA Drivers with CUDA Toolkit from NVIDIA Website

The minimum CUDA version supported by HEAVY.AI is 11.4. We recommend using a release that has been available for at least two months.

In the "Target Platform" section, follow these steps:

  1. For "Operating System" select Linux

  2. For Architecture" select x86_64

  3. For "Distribution" select Ubuntu

  4. For "Version" select the version of your operating system (20.04)

  5. For "Installer Type" choose deb (network) **

  6. One by one, run the presented commands in the Installer Instructions section on your server.

** You may optionally use any of the "Installer Type" options available.

If you choose to use the .run file option, prior to running the installer you will need to manually install build-essentials using apt and change permissions of the downloaded .run file to allow execution.

Option 2: Install NVIDIA Drivers via .run file using the NVIDIA Website

If you don't know the exact GPU model in your system run this command

You'll get an output in the format Product Type, Series and Model

In this example, the Product type is Tesla the Series is T (as Turing), and the model is T4.

  1. Select the Product Type as the one you got with the command.

  2. Select the correct Product Series and Product Type for your installation.

  3. In the Operating System dropdown list, select Linux 64-bit.

  4. In the CUDA Toolkit dropdown list, click a supported version (11.4 or higher).

  5. Click Search.

  6. On the resulting page, verify the download information and click Download

  7. On the subsequent page, if you agree to the terms, right click on "Agree and Download" and select "Copy Link Address". You may also manually download and transfer to your server, skipping the next step.

  8. On your server, type wget and paste the URL you copied in the previous step. Press enter to download.

Install the tools needed for installation.

Change the permissions of the downloaded .run file to allow execution, and run the installation.

Option 3: Install NVIDIA drivers using APT

Install a specific version of the driver for your GPU by installing the NVIDIA repository and using the apt package manager.

Run the command to get a list of the available driver's version

Install the driver version needed with apt

NVIDIA Driver Post-Installation steps

Reboot your system to ensure the new version of the driver is loaded

Verify Successful NVIDIA driver installation

Run nvidia-smi to verify that your drivers are installed correctly and recognize the GPUs in your environment. Depending on your environment, you should see something like this to confirm that your NVIDIA GPUs and drivers are present.

If you see an error like the following, the NVIDIA drivers are probably installed incorrectly:

Review the installation instructions, specifically checking for completion of install prerequisites, and correct any errors.

Install Vulkan library

The rendering engine of HEAVY.AI requires a Vulkan-enabled driver and the Vulkan library. Without these components, the database itself can't even start without disabling the back-end renderer.

Install the Vulkan library and its dependencies using apt.

Advanced Installation

You must install the CUDA toolkit and Clang if you use advanced features like C++ User-Defined Functions and or User-Defined Table Functions to extend the database capabilities.

Install CUDA Toolkit ᴼᴾᵀᴵᴼᴺᴬᴸ

Install the NVIDIA public repository GPG key.

Add the repository.

List the available Cuda toolkit versions.

Install the CUDA toolkit using apt.

Verification

Check that everything is working and the toolkit has been installed.

Install Clang ᴼᴾᵀᴵᴼᴺᴬᴸ

You must install Clang if you use advanced features like C++ User-Defined Functions and or User-Defined Table Functions to extend the database capabilities. Install Clang and LLVM dependencies using apt.

Verification

Check that the software is installed and in the execution path.

HEAVY.AI Installation using Docker on Ubuntu

Follow these steps to install HEAVY.AI as a Docker container on a machine running with on CPU or with supported NVIDIA GPU cards using Ubuntu as the host OS.

Preparation

Prepare your host by installing Docker and if needed for your configuration NVIDIA drivers and NVIDIA runtime.

Install Docker

Remove any existing Docker Installs and if on GPU the legacy NVIDIA docker runtime.

Use curl to add the docker's GPG key.

Add Docker to your Apt repository.

Update your repository.

Install Docker, the command line interface, and the container runtime.

Run the following usermod command so that docker command execution does not require sudo privilege. Log out and log back in for the changes to take effect. (reccomended)

Verify your Docker installation.

Install NVIDIA Drivers and NVIDIA Container ᴳᴾᵁ ᴼᴾᵀᴵᴼᴺ

Install NVIDIA Docker Runtime

Use curl to add Nvidia's Gpg key:

Update your sources list:

Update apt-get and install nvidia-container-runtime:

Edit /etc/docker/daemon.json to add the following, and save the changes:

Restart the Docker daemon:

Check Nvidia Drivers

Verify that docker and NVIDIA runtime work together.

If everything is working you should get the output of nvidia-smi command showing the installed GPUs in the system.

HEAVY.AI Installation

Create a directory to store data and configuration files

Then a minimal configuration file for the docker installation

Ensure that you have sufficient storage on the drive you choose for your storage dir running this command

Download HEAVY.AI from DockerHub and Start HEAVY.AI in Docker. Select the tab depending on the Edition (Enterprise, Free, or Open Source) and execution Device (GPU or CPU) you are going to use.

Check that the docker is up and running a docker ps commnd:

You should see an output similar to the following.

Configure Firewall ᴼᴾᵀᴵᴼᴺᴬᴸ

If a firewall is not already installed and you want to harden your system, install theufw.

To use Heavy Immerse or other third-party tools, you must prepare your host machine to accept incoming HTTP(S) connections. Configure your firewall for external access.

Most cloud providers use a different mechanism for firewall configuration. The commands above might not run in cloud deployments.

Licensing HEAVY.AI ᵉᵉ⁻ᶠʳᵉᵉ ᵒⁿˡʸ

  1. Connect to Heavy Immerse using a web browser to your host on port 6273. For example, http://heavyai.mycompany.com:6273.

  2. When prompted, paste your license key in the text box and click Apply.

  3. Log into Heavy Immerse by entering the default username (admin) and password (HyperInteractive), and then click Connect.

Command-Line Access

You can access the command line in the Docker image to perform configuration and run HEAVY.AI utilities.

You need to know the container-id to access the command line. Use the command below to list the running containers.

You see output similar to the following.

Once you have your container ID, in the example 9e01e520c30c, you can access the command line using the Docker exec command. For example, here is the command to start a Bash session in the Docker instance listed above. The -it switch makes the session interactive.

You can end the Bash session with the exit command.

Final Checks

Load Sample Data and Run a Simple Query

HEAVY.AI ships with two sample datasets of airline flight information collected in 2008, and a census of New York City trees. To install sample data, run the following command.

Where <container-id> is the container in which HEAVY.AI is running.

When prompted, choose whether to insert dataset 1 (7,000,000 rows), dataset 2 (10,000 rows), or dataset 3 (683,000 rows). The examples below use dataset 2.

Connect to HeavyDB by entering the following command (a password willò be asked; the default password is HyperInteractive):

Enter a SQL query such as the following:

The results should be similar to the results below.

Installing Enterprise or Free Edition, check if Heavy Immerse is running as intended.

  1. Connect to Heavy Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273.

  2. Log into Heavy Immerse by entering the default username (admin) and password (HyperInteractive), and then click Connect.

Create a new dashboard and a Scatter Plot to verify that backend rendering is working.

  1. Click New Dashboard.

  2. Click Add Chart.

  3. Click SCATTER.

  4. Click Add Data Source.

  5. Choose the flights_2008_10k table as the data source.

  6. Click X Axis +Add Measure.

  7. Choose depdelay.

  8. Click Y Axis +Add Measure.

  9. Choose arrdelay.

  10. Click Size +Add Measure.

  11. Choose airtime.

  12. Click Color +Add Measure.

  13. Choose dest_state.

The resulting chart shows, unsurprisingly, that there is a correlation between departure delay and arrival delay.

Create a new dashboard and a Table chart to verify that Heavy Immerse is working.

  1. Click New Dashboard.

  2. Click Add Chart.

  3. Click Bubble.

  4. Click Select Data Source.

  5. Choose the flights_2008_10k table as the data sour

  6. Click Add Dimension.

  7. Choose carrier_name.

  8. Click Add Measure.

  9. Choose depdelay.

  10. Click Add Measure.

  11. Choose arrdelay.

  12. Click Add Measure.

  13. Choose #Records.

The resulting chart shows, unsurprisingly, that also the average departure delay is correlated to the average of arrival delay, while there is quite a difference between Carriers.

¹ In the OS Edition, Heavy Immerse Service is unavailable.

² The OS Edition does not require a license key.

Getting Started on AWS

Getting Started with AWS AMI

You can use the HEAVY.AI AWS AMI (Amazon Web Services Amazon Machine Image) to try HeavyDB and Heavy Immerse in the cloud. Perform visual analytics with the included New York Taxi database, or import and explore your own data.

Many options are available when deploying an AWS AMI. These instructions skip to the specific tasks you must perform to deploy a sample environment.

Prerequisite

You need a security key pair when you launch your HEAVY.AI instance. If you do not have one, create one before you continue.

  1. Go to the EC2 Dashboard.

  2. Select Key Pairs under Network & Security.

  3. Click Create Key Pair.

  4. Enter a name for your key pair. For example, MyKey.

  5. Click Create. The key pair PEM file downloads to your local machine. For example, you would find MyKey.pem in your Downloads directory.

Launching Your Instance

  1. Click Continue to Subscribe to subscribe.

  2. Read the Terms and Conditions, and then click Continue to Configuration.

  3. Select the Fulfillment Option, Software Version, and Region.

  4. Click Continue to Launch.

  5. On the Launch this software page, select Launch through EC2, and then click Launch.

  6. From the Choose and Instance Type page, select an available EC2 instance type, and click Review and Launch.

  7. Review the instance launch details, and click Launch.

  8. Select a key pair, or click Create a key pair to create a new key pair and download it, and then click Launch Instances.

  9. On the Launch Status page, click the instance name to see it on your EC2 Dashboard Instances page.

Using HEAVY.AI Immerse on Your AWS Instance

To connect to Heavy Immerse, you need your Public IP address and Instance ID for the instance you created. You can find these values on the Description tab for your instance.

To connect to Heavy Immerse:

  1. Point your Internet browser to the public IP address for your instance, on port 6273. For example, for public IP 54.83.211.182, you would use the URL https://54.83.211.182:6273.

    1. Enter the USERNAME (admin), PASSWORD ( {Instance ID} ), and DATABASE (heavyai). If you are using the BYOL version, enter you license key in the key field and click Apply.

  2. Click Connect.

  3. On the Dashboards page, click NYC Taxi Rides. Explore and filter the chart information on the NYC Taxis Dashboard.

Importing Your Own Data

Working with your own familiar dataset makes it easier to see the advantages of HEAVY.AI processing speed and data visualization.

To import your own data to Heavy Immerse:

  1. Export your data from your current datastore as a comma-separated value (CSV) or tab-separated value (TSV) file. HEAVY.AI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI.

  2. Point your Internet browser to the public IP address for your instance, on port 6273. For example, for public IP 54.83.211.182, you would use the URL https://54.83.211.182:6273.

  3. Enter the USERNAME (admin) and PASSWORD ( {instance ID} ). If you are using the BYOL version, enter you license key in the key field and click Apply.

  4. Click Connect.

  5. Click Data Manager, and then click Import Data.

  6. Drag your data file onto the table importer page, or use the directory selector.

  7. Click Import Files.

  8. Verify the column names and datatypes. Edit them if needed.

  9. Enter a Name for your table.

  10. Click Save Table.

  11. Click Connect to Table.

  12. On the New Dashboard page, click Add Chart.

  13. Choose a chart type.

  14. Add dimensions and measures as required.

  15. Click Apply.

  16. Enter a Name for your dashboard.

  17. Click Save.

Accessing Your HEAVY.AI Instance Using SSH

  1. Open a terminal window.

  2. Locate your private key file (for example, MyKey.pem). The wizard automatically detects the key you used to launch the instance.

  3. Your key must not be publicly viewable for SSH to work. Use this command to change permissions, if needed:

  4. Connect to your instance using its Public DNS. The default user name is centos or ubuntu, depending on the version you are using. For example:

  5. Use the following command to run the heavysql SQL command-line utility on HeavyDB. The default user is admin and the default password is { Instance ID }:

Upgrading HEAVY.AI

This section is giving a recipe to upgrade between fully compatible products version.

This section is giving a recipe to upgrade between fully compatible products version.

As with any software upgrade, it is important that you back up your data before upgrading. Each release introduces efficiencies that are not necessarily compatible with earlier releases of the platform. HeavyAI is never expected to be backward compatible.

Back up the contents of your $HEAVYAI_STORAGE directory.

Upgrading from Omnisci

If you need to upgrade from Omnisci to HEAVY.AI 6.0 or later, please refer to the specific recipe.

Direct upgrades from Omnisci to HEAVY.AI version later than 6.0 aren't allowed nor supported.

Upgrading Using Docker

To upgrade HEAVY.AI in place in Docker

In a terminal window, get the Docker container ID.

You should see output similar to the following. The first entry is the container ID. In this example, it is 9e01e520c30c:

Stop the HEAVY.AI Docker container. For example:

Optionally, remove the HEAVY.AI Docker container. This removes unused Docker containers on your system and saves disk space.

Backup the Omnisci data directory (typically /var/lib/omnisci)

Download the latest version of the HEAVY.AI Docker image according to the Edition and device you are actually coming from Select the tab depending on the Edition (Enterprise, Free, or Open Source) and execution Device (GPU or CPU) you are upgrading.

If you don't want to upgrade to the latest version but want to upgrade to a specific version, change thelatesttag with the version needed.

If the version needed is the 6.0 use v6.0.0 as the version tag in the image name

heavyai/heavyai-ee-cuda:v6.0.0

Check that the docker is up and running a docker ps commnd:

You should see an output similar to the following.

This runs both the HEAVY.AI database and Immerse in the same container.

You can optionally add --rm to the Docker run command so that the container is removed when it is stopped.

Upgrading HEAVY.AI Using Package Managers and Tarball

To upgrade an existing system installed with package managers or tarball. The commands upgrade HEAVY.AI in place without disturbing your configuration or stored data

Stop the HEAVY.AI services.

Back up your $HEAVYAI_STORAGE directory (the default location is /var/lib/heavyai).

Run the appropriate set of commands depending on the method used to install the previous version of the software.

Make a backup of your actual installation

When the upgrade is complete, start the HEAVY.AI services.

Getting Started on Azure

Getting Started with HEAVY.AI on Microsoft Azure

Follow these instructions to get started with HEAVY.AI on Microsoft Azure.

Prerequisites

Configure Your HEAVY.AI Instance

To launch HEAVY.AI on Microsoft Azure, you configure a GPU-enabled instance.

1) Log in to you Microsoft Azure portal.

2) On the left side menu, create a Resource group, or use one that your organization has created.

3) On the left side menu, click Virtual machines, and then click Add.

4) Create your virtual machine:

  • On the Basics tab:

    • In Project Details, specify the Resource group.

    • Specify the Instance Details:

      • Virtual machine name

      • Region

      • Image (Ubuntu 16.04 or higher, or CentOS/RHEL 7.0 or higher)

      • Size. Click Change size and use the Family filter to filter on GPU, based on your use case and requirements. Not all GPU VM variants are available in all regions.

    • For Username, add any user name other than admin.

    • In Inbound Port Rules, click Allow selected ports and select one or more of the following:

      • HTTP (80)

      • HTTPS (443)

      • SSH (22)

  • On the Disks tab, select Premium or Standard SSD, depending on your needs.

  • For the rest of the tabs and sections, use the default values.

5) Click Review + create. Azure reviews your entries, creates the required services, deploys them, and starts the VM.

6) Once the VM is running, select the VM you just created and click the Networking tab.

7) Click the Add inbound button and configure security rules to allow any source, any destination, and destination port 6273 so you can access Heavy Immerse from a browser on that port. Consider renaming the rule to 6273-Immerse or something similar so that the default name makes sense.

8) Click Add and verify that your new rule appears.

Getting Started on GCP

Getting Started with HEAVY.AI on Google Cloud Platform

Follow these instructions to get started with HEAVY.AI on Google Cloud Platform (GCP).

Prerequisites

To launch HEAVY.AI on Google Cloud Platform, you select and configure an instance.

Launching Your HEAVY.AI Instance

On the solution Launcher Page, click Launch on Compute Engine to begin configuring your deployment.

To launch HEAVY.AI on Google Cloud Platform, you select and configure a GPU-enabled instance.

  1. On the solution Launcher Page, click Launch to begin configuring your deployment.

  2. On the new deployment page, configure the following:

    • Deployment name

    • Zone

    • Machine type - Click Customize and configure Cores and Memory, and select Extend memory if necessary.

    • GPU type. (Not applicable for CPU configurations.)

    • Boot disk type

    • Boot disk size in GB

    • Networking - Set the Network, Subnetwork, and External IP.

    • Firewall - Select the required ports to allow TCP-based connectivity to HEAVY.AI. Click More to set IP ranges for port traffic and IP forwarding.

  3. Accept the GCP Marketplace Terms of Service and click Deploy.

  4. In the Deployment Manager, click the instance that you deployed.

  5. Launch the Heavy Immerse client:

    • Record the Admin password (Temporary).

    • Click the Site address link to go to the Heavy Immerse login page. Enter the password you recorded, and click Connect.

    • Connect to Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273.

    • When prompted, paste your license key in the text box and click Apply.

    • Click Connect to start using HEAVY.AI.

    On successful login, you see a list of sample dashboards loaded into your instance.

Upgrading

In this section, you will find recipes to upgrade from the OmniSci to the HEAVY.AI platform and upgrade between versions of the HEAVY.AI platform.

Supported Upgrade Path

The following table shows the steps needed to move from one version to a later one.

Versions 5.x and 6.0.0 are no longer supported; use these only as needed to facilitate an upgrade to a supported version.

Example: if you are running an OmniSci version older than 5.5, you must first upgrade to 5.5, then upgrade to 6.0 and after that upgrade to 7.0. If you are running 6.0 - 6.4, you can upgrade directly to 7.0 in a single step.

Getting Started on Kubernetes (BETA)

Using HEAVY.AI's Helm Chart on Kubernetes

This documentation outlines how to use HEAVY.AI’s Helm Chart within a Kubernetes environment. It assumes the user is a network administrator within your organization and is an experienced Kubernetes administrator. This is not a beginner guide and does not instruct on Kubernetes installation or administration. It is quite possible you will require additional manifest files for your environment.

Overview

The HEAVY.AI Helm Chart is a template of how to configure deployment of the HEAVY.AI platform. The following files need to be updated/created to reflect the customer's deployment environment.

  • values.yml

  • <customer_created>-pv.yml

  • <customer_created>-pvc.yml

Once the files are updated/created, follow the installation instructions below to install the Helm Chart into your Kubernetes environment.

Where to get the Helm Chart?

What’s included?

How to install?

  1. Before installing, create a PV/PVC that the deployment will use. Save these files in the regular PVC/PV location used in the customer’s environment. Reference the README.pdf file found in the Helm Chart under templates and the example PV/PVC manifests in the misc folder in the helm chart. The PVC name is then provided to the helm install command.

  2. In your current directory; copy the values.yml file from the HEAVY.AI Helm Chart and customize for your needs.

  3. Run the helm install command with the desired deployment name and Helm Chart.

    1. When using a values.yml file:

      $ helm install heavyai --values values.yml heavyaihelmchart-1.0.0.tgz

    2. When not using a values.yml file:

      If you only need to change a value or two from the default values.yml file you can use --set instead of a custom values.yml file.

      For example:

      $ helm install heavyai --set pvcName=MyPVCName heavyaihelmchart-1.0.0.tgz

How to uninstall?

To uninstall the helm installed HEAVY.AI instance:

$ helm uninstall heavyai

The PVC and PV space defined for the HEAVY.AI instance is not removed. The retained space must be manually deleted.

Example: values.yml

Example: example-heavyai-pvc.yml

Example: example-heavyai-pv.yml

HEAVY.AI is an analytics platform designed to handle very large datasets. It leverages the processing power of GPUs alongside traditional CPUs to achieve very high performance. HEAVY.AI combines an open-source SQL engine (), server-side rendering (), and web-based data visualization () to provide a comprehensive platform for data analysis.

With native , HeavyDB returns query results hundreds of times faster than CPU-only analytical database platforms. Use your existing SQL knowledge to query data. You can use the standalone SQL engine with the command line, or the SQL editor that is part of the visual analytics interface. Your SQL query results can output to Heavy Immerse or to third-party software such as Birst, Power BI, Qlik, or Tableau.

HeavyDB is open source and encourages contribution and innovation from a global community of users. It is under the Apache 2.0 license, along with components like a Python interface (heavyai) and JavaScript infrastructure (mapd-connector, mapd-charting), making HEAVY.AI the leader in open-source analytics.

Complex server-side visualizations are specified using an . Heavy Immerse generates Vega rendering specifications behind the scenes; however, you can also generate custom visualizations using the same API. This customizable visualization system combines the agility of a lightweight frontend with the power of a GPU engine.

is a web-based data visualization interface that uses HeavyDB and HeavyRender for visual interaction. Intuitive and easy to use, Heavy Immerse provides standard visualizations, such as line, bar, and pie charts, as well as complex data visualizations, such as geo point maps, geo heat maps, choropleths, and scatter plots. Heavy Immerse provides quick insights and makes them easy to recognize.

Use to create and organize your charts. Dashboards automatically cross-filter when interacting with data, and refresh with zero latency. You can create dashboards and interact with conventional charts and data tables, as well as scatterplots and geo charts created by HeavyRender. You can also create your own queries in the .

Heavy Immerse lets you create a variety of different . You can display pointmaps, heatmaps, and choropleths alongside non-geographic charts, graphs, and tables. When you zoom into any map, visualizations refresh immediately to show data filtered by that geographic context. Multiple sources of geographic data can be rendered as different layers on the same map, making it easy to find the spatial relationships between them.

You can download HEAVY.AI for your preferred platform from .

Use of HEAVY.AI is subject to the terms of the .

Learn how to use to gain new insights to your data with fast, responsive graphics and .

Learn how to and configure your HEAVY.AI instance, then for analysis.

Learn how to extend HEAVY.AI with an integrated and custom . Contribute to the HEAVY.AI Core Open Source project.

For more complete release information, see the .

: SHOW TABLES, SHOW DATABASES, SHOW CREATE TABLE, and SHOW USER SESSIONS.

Completely overhauled , including query formatting, snippets, history and more.

Initial support for (that is, non-persistent) tables.

Pie chart now supports and percentage labels.

Cohorts can now be built with

Dashboard URLs now .

To see these new features in action, please watch this , where Rachel Wang demonstrates how you can use them.

This link is for the benefit of the search crawler.

Use of HEAVY.AI is subject to the terms of the .

The latest release of HEAVY.AI is .

| | | | | | | | | | | | | | | | | | |

For release notes for releases that are no longer supported, as well as links to documentation for those releases, see .

For assistance during the upgrade process, contact HEAVY.AI Support by logging a request through the .

Adds the HeavyDB server configuration parameter for enabling or disabling automated foreign table scheduled refreshes..

Enable Contour charts by default (feature flag: ).

Restrict export from Heavy Immerse by enabling trial mode (feature flag: ). Trial mode enables a super user to restrict export capabilities for users who have the immerse_trial_mode role.

RHEL-based distributions require Dynamic Kernel Module Support (DKMS) to build the GPU driver kernel modules. For more information, see . Upgrade the kernel and restart the machine.

CUDA is a parallel computing platform and application programming interface (API) model. It uses a CUDA-enabled graphics processing unit (GPU) for general-purpose processing. The CUDA platform provides direct access to the GPU virtual instruction set and parallel computation elements. For more information on CUDA unrelated to installing HEAVY.AI, see . You can install drivers in multiple ways. This section provides installation information using the or using .

Install the CUDA package for your platform and operating system according to the instructions on the NVIDIA website ).

Please check that the driver's version you are downloading meets the HEAVI.AI .

When installing the driver, ensure that your GPU model is supported and meets the HEAVI.AI .

Review the section and correct any errors.

Ignore it now; you can verify NVIDIA driver installation .

For more information about troubleshooting Vulkan, see the section.

Follow these instructions to install a headless JDK and configure an environment variable with a path to the library. The “headless” Java Development Kit does not provide support for keyboard, mouse, or display systems. It has fewer dependencies and is best suited for a server host. For more information, see .

Start and use HeavyDB and Heavy Immerse.

For more information, see .

If you are on Enterprise or Free Edition, you need to validate your HEAVY.AI instance with your license key. You can skip this section if you are using Open Source Edition.

Copy your license key from the registration email message. If you have not received your license key, contact your Sales Representative or register for your 30-day trial .

To verify that everything is working, load some sample data, perform a heavysql query, and generate a Pointmap using Heavy Immerse.

Create a Dashboard Using Heavy Immerse ᵉᵉ⁻ᶠʳᵉᵉ ᵒⁿˡʸ

Go to the , and in the HEAVY.AI Free section, click Get Free License.

In the What's Next section, click to select the best version of HEAVY.AI for your hardware and software configuration. Follow the instructions for the download or cloud version you choose.

, using the instructions for your platform.

Verify that OmniSci is working correctly by following the instructions in the Checkpoint section at the end of the installation instructions. For example, the Checkpoint instructions for the CentOS CPU with Tarball installation is .

Open the .

Use the CREATE USER command to create a new user. For information on syntax and options, see .

For more information about troubleshooting Vulkan, see the section.

: Install NVIDIA drivers with CUDA toolkit from NVIDIA Website

: Install NVIDIA drivers via .run file using the NVIDIA Website

: Install NVIDIA drivers using APT package manager

CUDA is a parallel computing platform and application programming interface (API) model. It uses a CUDA-enabled graphics processing unit (GPU) for general-purpose processing. The CUDA platform provides direct access to the GPU virtual instruction set and parallel computation elements. For more information on CUDA unrelated to installing HEAVY.AI, see .

Open and select the desired CUDA Toolkit version to install.

Install the CUDA package for your platform and operating system according to the instructions on the NVIDIA website ().

Please check that the driver's version you are downloading meets the HEAVI.AI

Be careful when choosing the driver version to install. Ensure that your GPU's model is supported and that meets the HEAVI.AI

For more information about troubleshooting Vulkan, see the section.

If you installed NVIDIA drivers using above, the CUDA toolkit is already installed; you may proceed to the verification step below.

For more information, see C++ .

For more information on Docker installation, see the .

Install NVIDIA driver and Cuda Toolkit using

See also the note regarding the in Optimizing Performance.

For more information, see .

If you are on Enterprise or Free Edition, you need to validate your HEAVY.AI instance using your license key. You must skip this section if you are on Open Source Edition

Copy your license key of Enterprise or Free Edition from the registration email message. If you don't have a license and you want to evaluate HEAVY.AI in an enterprise environment, contact your Sales Representative or register for your 30-day trial of Enterprise Edition . If you need a Free License you can get one .

To verify that everything is working, load some sample data, perform a heavysql query, and generate a Scatter Plot or a Bubble Chart using Heavy Immerse

Create a Dashboard Using Heavy Immerse ᵉᵉ⁻ᶠʳᵉᵉ ᵒⁿˡʸ

Go to the and select the version you want to use. You can get overview information about the product, see pricing, and get usage and support information.

If you receive an error message stating that the connection is not private, follow the prompts onscreen to click through to the unsecured website. To secure your site, see .

For more information on Heavy Immerse features, see .

For more information, see .

Follow these instructions to connect to your instance using SSH from MacOS or Linux. For information on connecting from Windows, see .

For more information, see .

See also the note regarding the in Optimizing Performance.

Download and Install the latest version following the install documentation for your Operative System and

You must have a Microsoft Azure account. If you do not have an account, go to to sign up for one.

Azure-specific configuration is complete. Now, follow standard for your Linux distribution and installation method.

You must have a Google Cloud Platform account. If you do not have an account, follow to sign up for one.

Before deploying a solution with a GPU machine type, avoid potential deployment failure by to make sure that you have not exceeded your limit.

Search for HEAVY.AI on the , and select a solution. HEAVY.AI has four instance types:

.

.

.

.

Number of GPUs - (Not applicable for CPU configurations.) Select the number of GPUs; subject to quota and GPU type by region. For more information about GPU-equipped instances and associated resources, see .

Copy your license key from the registration email message. If you have not received your license key, contact your Sales Representative or register for your 30-day trial .

Initial Version
Final Version
Upgrade Steps

The Helm Chart is located in the HEAVY.AI github repository. It can be found here:

File Name
Description
SQL support
Heavy Immerse
available on Github
adaptation of the Vega Visualization Grammar
Heavy Immerse
dashboards
SQL editor
chart types
https://www.heavy.ai/platform/downloads/
CentOS
Ubuntu
Docker
AWS
GCP
Azure
Jupyter
Upgrading
Uninstalling
HEAVY.AI End User License Agreement (EULA)
Immerse
SQL queries
Install
load data
data science foundation
charts and interfaces
Release Notes
SQL SHOW commands
SQL Editor
TEMPORARY
"All Others"
aggregation-based filters.
link to individual filter sets
video from Converge 2019
sitemap
HEAVY.AI End User License Agreement (EULA)
Archived Release Notes
HEAVY.AI Support Portal
enable-foreign-table-scheduled-refresh
https://fedoraproject.org/wiki/EPEL
(https://developer.nvidia.com/cuda-downloads
minimum requirements
minimum requirements
Vulkan Renderer
https://openjdk.java.net
https://fedoraproject.org/wiki/Firewalld?rd=FirewallD
here
HeavyDB
HeavyRender
Heavy Immerse
7.2.4
7.2.3
7.2.2
7.2.1
7.
2.0
7.1.2
7.1.1
7.1.0
7.0.2
7.0.1
7.0.0
6.4.4
6.4.3
6.4.2
6.4.1
6.4.0
6.2.7
6.2.5
6.2.4
6.2.1
6.2.0
https://developer.nvidia.com/cuda-zone
NVIDIA website
yum
Install NVIDIA Drivers
here
¹
²
¹
¹
sudo apt update
sudo apt upgrade -y
sudo reboot
sudo apt install linux-headers-$(uname -r)
sudo apt install pciutils
sudo apt install libvulkan1
lspci -v | egrep "3D|VGA*.NVIDIA" | awk -F '\[|\]' ' { print $2 } '
Tesla T4
sudo apt install build-essential
chmod +x NVIDIA-Linux-x86_64-*.run
sudo ./NVIDIA-Linux--x86_64-*.run
apt list nvidia-driver-*
Listing... Done

nvidia-driver-450/bionic-updates,bionic-security 460.91.03-0ubuntu0.18.04.1 amd64
nvidia-driver-450-server/bionic-updates,bionic-security 450.172.01-0ubuntu0.18.04.1 amd64
nvidia-driver-455/bionic-updates,bionic-security 460.91.03-0ubuntu0.18.04.1 amd64
nvidia-driver-460/bionic-updates,bionic-security 470.103.01-0ubuntu0.18.04.1 amd64
nvidia-driver-465/bionic-updates,bionic-security 470.103.01-0ubuntu0.18.04.1 amd64
nvidia-driver-470/bionic-updates,bionic-security 470.103.01-0ubuntu0.18.04.1 amd64
nvidia-driver-470-server/bionic-updates,bionic-security 470.103.01-0ubuntu0.18.04.1 amd64
nvidia-driver-495/bionic-updates,bionic-security 510.60.02-0ubuntu0.18.04.1 amd64
nvidia-driver-510/bionic-updates,bionic-security 510.60.02-0ubuntu0.18.04.1 amd64
nvidia-driver-510-server/bionic-updates,bionic-security 510.47.03-0ubuntu0.18.04.1 amd64
sudo apt install nvidia-driver-<version>
sudo reboot
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. 
Make sure that the latest NVIDIA driver is installed and running.
sudo apt install libvulkan1
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
sudo apt-key adv --fetch-keys \
https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/3bf863cc.pub
echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" \
| sudo tee /etc/apt/sources.list.d/cuda.list
apt update
apt list cuda-toolkit-* | grep -v config

Listing...
cuda-toolkit-10-0/unknown 10.0.130-1 amd64
cuda-toolkit-10-1/unknown 10.1.243-1 amd64
cuda-toolkit-10-2/unknown 10.2.89-1 amd64
cuda-toolkit-11-0/unknown 11.0.3-1 amd64
cuda-toolkit-11-1/unknown 11.1.1-1 amd64
cuda-toolkit-11-2/unknown 11.2.2-1 amd64
cuda-toolkit-11-3/unknown 11.3.1-1 amd64
cuda-toolkit-11-4/unknown 11.4.4-1 amd64
cuda-toolkit-11-5/unknown 11.5.2-1 amd64
cuda-toolkit-11-6/unknown 11.6.2-1 amd64
cuda-toolkit-11-7/unknown 11.7.0-1 amd64
sudo apt install cuda-toolkit-<version>
/usr/local/cuda/bin/nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0
sudo apt install clang
clang --version
clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
sudo docker volume ls -q -f driver=nvidia-docker \
| xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge nvidia-docker
sudo apt-get remove docker docker-engine docker.io containerd runc
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| sudo apt-key add -
sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io
sudo usermod  --append --groups docker $USER
sudo docker run hello-world
curl --silent --location https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl --silent --location https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install -y nvidia-container-runtime
{
  "default-runtime": "nvidia",
  "runtimes": {
     "nvidia": {
         "path": "/usr/bin/nvidia-container-runtime",
         "runtimeArgs": []
     }
 }
}
sudo pkill -SIGHUP dockerd
sudo docker run --gpus=all \
--rm nvidia/cuda:11.0-runtime-ubuntu20.04 nvidia-smi
sudo mkdir -p /var/lib/heavyai && sudo chown $USER /var/lib/heavyai
echo "port = 6274
http-port = 6278
calcite-port = 6279
data = \"/var/lib/heavyai\"
null-div-by-zero = true

[web]
port = 6273
frontend = \"/opt/heavyai/frontend\"" \
>/var/lib/heavyai/heavy.conf
if test -d /var/lib/heavyai; then echo "There is $(df -kh /var/lib/heavyai --output="avail" | sed 1d) avaibale space in you storage dir"; else echo "There was a problem with the creation of storage dir";  fi;
sudo docker run -d --gpus=all \
-v /var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/heavyai-ee-cuda:latest
sudo docker run -d \
-v /var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/heavyai-ee-cpu:latest
sudo docker run -d --gpus=all \
-v /var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/core-os-cuda:latest
sudo docker run -d \
-v /var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/core-os-cpu:latest
sudo docker container ps --format "{{.Image}} {{.Status}}" \
-f status=running | grep heavyai\/
heavyai/heavyai-ee-cuda Up 48 seconds ago 
sudo apt install ufw
sudo ufw allow ssh
sudo ufw disable
sudo ufw allow 6273:6278/tcp
sudo ufw enable
sudo docker container ps
CONTAINER ID        IMAGE                     COMMAND                     CREATED             STATUS              PORTS                                            NAMES
9e01e520c30c        heavyai/heavyai-ee-gpu    "/bin/sh -c '/heavyai..."   50 seconds ago      Up 48 seconds ago   0.0.0.0:6273-6280->6273-6280/tcp                 confident_neumann
sudo docker exec -it 9e01e520c30c bash
sudo docker exec -it <container-id> \
./insert_sample_data --data /var/lib/heavyai/storage
Enter dataset number to download, or 'q' to quit:
#     Dataset                   Rows    Table Name             File Name
1)    Flights (2008)            7M      flights_2008_7M        flights_2008_7M.tar.gz
2)    Flights (2008)            10k     flights_2008_10k       flights_2008_10k.tar.gz
3)    NYC Tree Census (2015)    683k    nyc_trees_2015_683k    nyc_trees_2015_683k.tar.gz
sudo docker exec -it <container-id> bin/heavysql 
SELECT origin_city AS "Origin", 
dest_city AS "Destination", 
ROUND(AVG(airtime),1) AS "Average Airtime" 
FROM flights_2008_10k 
WHERE distance < 175 GROUP BY origin_city,
dest_city;
Origin|Destination|Average Airtime
West Palm Beach|Tampa|33.8
Norfolk|Baltimore|36.1
Ft. Myers|Orlando|28.7
Indianapolis|Chicago|39.5
Tampa|West Palm Beach|33.3
Orlando|Ft. Myers|32.6
Austin|Houston|33.1
Chicago|Indianapolis|32.7
Baltimore|Norfolk|31.7
Houston|Austin|29.6
ssh -i MyKey.pem centos@ec2-12-345-678-901.us-west-2.compute.amazonaws.com
$HEAVYAI_PATH/bin/heavysql
sudo docker container ps --format "{{.Id}} {{.Image}}" \
-f status=running | grep omnisci\/
9e01e520c30c omnisci/omnisci-ee-gpu
docker container stop 9e01e520c30c
docker container rm 9e01e520c30c
tar zcvf /backup_dir/omnisci_storage_backup.tar.gz /var/lib/omnisci
sudo docker run -d --gpus=all \
  -v /var/lib/heavyai:/var/lib/heavyai \
  -p 6273-6278:6273-6278 \
  heavyai/heavyai-ee-cuda:latest
sudo docker run -d -v \
/var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/heavyai-ee-cpu:latest
sudo docker run -d --gpu=all \
  -v /var/lib/heavyai:/var/lib/heavyai \
  -p 6273-6278:6273-6278 \
  heavyai/core-os-cuda:latest
sudo docker run -d -v \
/var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/core-os-cpu:latest
sudo docker container ps --format "{{.Image}} {{.Status}}" \
-f status=running | grep heavyai\/
heavyai/heavyai-ee-cuda Up 48 seconds ago 
sudo systemctl stop heavydb heavy_web_server
sudo yum update heavyai.x86_64
sudo apt update
sudo apt upgrade heavyai
sudo mv /opt/heavyai /opt/heavyai_backup
sudo systemctl start heavydb heavy_web_server
     Helm-workspace
          ↳heavyai
               ↳Chart.yml
               ↳values.yml
	       ↳templates
	            ↳README.pdf
                    ↳deployment.yml
          ↳misc
               ↳example-heavyai-pv.yml
               ↳example-heavyai-pvc.yml

Chart.yml

HEAVY.AI Helm Chart. Contains version and contact information.

values.yml

Copy this file and edit values specific to your HEAVY.AI deployment. This is where to note the PVC name. This file is annotated to identify typical customizations and is pre-populated with default values.

README.pdf

These instructions.

deployment.yml

HEAVY.AI platform deployment template. DO NOT EDIT

example-heavyai-pv.yml

Example PV file.

example-heavyai-pv.yml

Example PVC file.

# Default values for heavyai.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
#
# Version of heavyai to install in the format 'v7.0.0' or 'latest' for the latest version released.
version: v7.0.0
# Persistant volume claim name to use with heavyai.
pvcName: heavyai-pvc
# Namespace to install heavyai in.
nameSpace: heavyai
# Number or GPU's to assign to heavyai or 0 to run the CPU version of heavyai.
gpuNumber: 1
# NodeName to install heavyai on, if you wish to let Kubernetes schedule a host, leave it blank.
nodeName: heavyai-node
# Immerse port redirect of 6273.
hostPortImmerse: 9273
# TCP port redirect of 6274.
hostPortTCP: 9274
# HTTP port redirect of 6278.
hostPortHTTP: 9278
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: heavyai-pvc
 namespace: heavyai
spec:
 volumeMode: Filesystem
 accessModes:
   - ReadWriteOnce
 resources:
   requests:
     storage: 100Gi
 storageClassName: heavyai
apiVersion: v1
kind: PersistentVolume
metadata:
 name: heavyai-pv
spec:
 capacity:
   storage: 100Gi
 volumeMode: Filesystem
 accessModes:
   - ReadWriteOnce
 persistentVolumeReclaimPolicy: Retain
 storageClassName: heavyai
 mountOptions:
   - hard
   - nfsvers=4.1
 nfs:
   path: {your nfs path goes here }
   server: { your nfs server name goes here }
FAQ

Ports

HEAVY.AI uses the following ports.

Port

Service

Use

6273

heavy_web_server

Used to access Heavy Immerse.

6274

heavydb tcp

Used by connectors (heavyai, omnisql, odbc, and jdbc) to access the more efficient Thrift API.

6276

heavy_web_server

Used to access the HTTP/JSON thrift API.

6278

heavydb http

Used to directly access the HTTP/binary thrift API, without having to proxy through heavy_web_server. Recommended for debugging use only.

Services and Utilities

Uninstalling

This is a recipe to permanently remove HEAVY.AI Software, services, and data from your system.

Uninstalling HEAVY.AI from Docker

To uninstall HEAVY.AI in Docker, stop and delete the current Docker container.

In a terminal window, get the Docker container ID:

sudo docker container ps --format "{{.Id}} {{.Image}}" \
-f status=running | grep heavyai\/

You should see an output similar to the following. The first entry is the container ID. In this example, it is 9e01e520c30c:

9e01e520c30c omnisci/omnisci-ee-gpu

To see all containers, both running and stopped, use the following command:

sudo docker container ps -a

Stop the HEAVY.AI Docker container. For example:

sudo docker container stop 9e01e520c30c

Remove the HEAVY.AI Docker container to save disk space. For example:

sudo docker container rm 9e01e520c30c

Uninstalling HEAVY.AI on Redhat and Ubuntu

To uninstall an existing system installed with Yum, Apt, or Tarball connect using the user that runs the platform, typically heavyai.

Disable and stop all HEAVY.AI services.

sudo systemctl disable heavy_web_server --now
sudo systemctl disable heavydb --now

Remove the HEAVY.AI Installation files. (the $HEAVYAI_PATH defaults to /opt/heavyai)

sudo yum remove heavyai.x86_64
sudo apt remove heavyai
sudo rm -r $(readlink $HEAVYAI_PATH) $HEAVYAI_PATH

Delete the configuration files and the storage removing the $HEAVYAI_BASE directory. (defaults to /var/lib/heavyai)

sudo rm  -r $HEAVYAI_BASE

Remove permanently the configuration of the services.

sudo rm /lib/systemd/heavydb*.service
sudo rm /lib/systemd/heavy_web_server*.service
sudo systemctl daemon-reload
sudo systemctl reset-failed

Configuration Parameters

Overview

HEAVY.AI has minimal configuration requirements with a number of additional configuration options. This topic describes the required and optional configuration changes you can use in your HEAVY.AI instance.

In release 4.5.0 and higher, HEAVY.AI requires that all configuration flags used at startup match a flag on the HEAVY.AI server. If any flag is misspelled or invalid, the server does not start. This helps ensure that all settings are intentional and will not have an unexpected impact on performance or data integrity.

Storage Directory

Before starting the HEAVY.AI server, you must initialize the persistent storage directory. To do so, create an empty directory at the desired path, such as /var/lib/heavyai.

  1. Create the environment variable $HEAVYAI_BASE.

export HEAVYAI_BASE=/var/lib/heavyai

2. Then, change the owner of the directory to the user that the server will run as ($HEAVYAI_USER):

sudo mkdir -p $HEAVYAI_BASE sudo chown -R $HEAVYAI_USER $HEAVYAI_BASE

where $HEAVYAI_USER is the system user account that the server runs as, such as heavyai, and $HEAVYAI_BASE is the path to the parent of the HEAVY.AI server storage directory.

3. Run $HEAVYAI_PATH/bin/initheavy with the storage directory path as the argument:

$HEAVYAI_PATH/bin/initheavy $HEAVYAI_BASE/storage

Configuring a Custom Heavy Immerse Subdirectory

Immerse serves the application from the root path (/) by default. To serve the application from a sub-path, you must modify the $HEAVYAI_PATH/frontend/app-config.js file to change the IMMERSE_PATH_PREFIX value. The Heavy Immerse path must start with a forward slash (/).

Configuration File

The configuration file stores runtime options for your HEAVY.AI servers. You can use the file to change the default behavior.

The heavy.conf file is stored in the $HEAVYAI_BASE directory. The configuration settings are picked up automatically by the sudo systemctl start heavydb and sudo systemctl start heavy_web_server commands.

Set the flags in the configuration file using the format <flag> = <value>. Strings must be enclosed in quotes.

The following is a sample configuration file. The entry for data path is a string and must be in quotes. The last entry in the first section, for null-div-by-zero, is the Boolean value true and does not require quotes.

port = 6274 
http-port = 6278
data = "/var/lib/heavyai/storage"
null-div-by-zero = true

[web]
port = 6273
frontend = "/opt/heavyai/frontend"
servers-json = "/var/lib/heavyai/servers.json"
enable-https = true

To comment out a line in heavy.conf, prepend the line with the pound sign (#) character.

For encrypted backend connections, if you do not use a configuration file to start the database, Calcite expects passwords to be supplied through the command line, and calcite passwords will be visible in the processes table. If a configuration file is supplied, then passwords must be supplied in the file. If they are not, Calcite will fail.

Executor Resource Manager

Overview

To enable concurrent execution of queries, we introduce the concept of an Executor Resource Manager (ERM). This keeps track of compute and memory resources to gate query execution and ensures that compute resources are not over-subscribed. As of version 7.0, ERM is enabled by default.

The ERM evaluates several kinds of resources required by a query. Currently this includes CPU cores, GPUs, buffer and result set memory. It will leverage all available resources unless policy limits have been established, such as for maximum memory use or query time. It determines both the ideal/maximum amount of resources desirable for optimal performance and the minimum required. For example, a CPU query scanning 8 fragments could run with up to 8 threads, but could execute with as little as a single CPU thread with correspondingly less memory if needed.

The ERM establishes a request queue. On every new request, as well as every time an existing request is completed, it checks available resources and picks the next resource request to grant. It currently always gives preference to earlier requests if resources permit launching them (first in, first out, or “FIFO”).

If the system-level multi-executor flag is enabled, the ERM will allow multiple queries to execute at once so long as resources are available. Currently, multiple execution is allowed for CPU queries (and multiple CPU queries and a single GPU query). This supports significant throughput gains by allowing inter-query-kernel concurrency, in addition to the major win of not having a long-running CPU query block the queue for other CPU queries or interactive GPU queries. The number of queries that can be run in parallel is limited by the number of executors

Use of CPU and GPU

By default, if HeavyDB is compiled to run on GPUs and if GPUs are available, query steps/kernels will execute on GPU UNLESS:

  1. Some operations in the query step cannot run on GPU. Operators like MODE, APPROX_MEDIAN/PERCENTILE, and certain string functions are examples.

  2. Update and delete queries currently run on CPU.

  3. The query step requires more memory than available on GPU, but less than available on CPU.

  4. A user explicitly requests their query run on CPU, either via setting a session flag or via a query hint.

At the instance level, this behavior can be configured with system flags on startup. For example a system with GPU can be configured to use only CPU using the cpu-only flag. Or the system use of CPU RAM can be controlled using cpu-buffer-mem-bytes. Execution can also be routed to different device types with query hints such as “SELECT /*+ cpu_mode */ …” These controls do not require the ERM but are platform-wide.

Example Use Cases

Example 1: (no tuning required)

In a scenario where the system hasn’t enough memory available for the cpu-cache or the cache itself is too fragmented to accommodate all the columns’ chunks into cpu-caches, the EMR instead of failing the query with an OOM error, will

  1. run the query reading a single chunk at a time and moving data to GPU caches for a GPU execution.

  2. in case that there isn’t enough GPU memory will run the query chunk by chunk in CPU mode. In this case the query will run slower, but this will free up the GPU executor for less memory demanding queries.

Example 2: (minimal tuning required)

You are deploying a new dashboard or chart which doesn’t require big data or high performance, and so you prefer to run it just on CPU. This way it doesn’t interfere with other performance-critical dashboards or charts.

  1. Set the dashboard chart execution to CPU using query hints. Instead of referencing data directly, set a new “custom data source.” For example, if your data is in a table called ‘mydata’, In the custom source, after your SELECT keyword, add the CPU query hint: You can repeat this for a data source supporting any number of charts desired, including all charts.

  2. Bump up the number of executors (default 4) to 6-8. With more executors free, the dashboard will perform better, without impacting the performance of the other dashboards.

Example 3: (some tuning required)

Improving performance of memory-intensive operations like high cardinality aggregates.

A user conducting exact “count distinct” operations on large datasets, with high cardinality that are likely to be run on CPU, on a server having many CPU cores might employ the following strategy:

  1. Increase the number of executors (default 4) to 8-16. --num-executors=16

  2. Limit CPU total memory use using --cpu-buffer-mem-bytes from default 80% to make some room for large result sets, that now are limited by the executor-cpu-result-mem-ratio.

If those query have sparse value or and high cardinality and are using a wide count distinct will be pushed to CPU execution. Change the executor-per-query-max-cpu-threads-ratio parameter to lower the number of cores that will run a single query; doing that the groupby buffers will be built in a faster way, lowering the memory footprint and speeding up the runtime of query.

Installation Recipes
Upgrading OmniSci
Configuration Flags and Runtime Settings
Loading Data
Using OmniSci Immerse
Vega Tutorials
Try Vega
Heavy Immerse Chart Types
Try HEAVY.AI Cloud
Getting Started with AWS AMI
Getting Started with Microsoft Azure
Getting Started on Google Cloud Platform
Vega Rendering API Overview
omnisql
Thrift
JDBC
ODBC
Vega
RJDBC
pyomnisci
Release Notes
Known Issues, Limitations, and Changes to Default Behavior
Get Started with HEAVY.AI
See Install Options
Install HEAVY.AI
here
SQL Editor
Vulkan Renderer
https://developer.nvidia.com/cuda-zone
https://developer.nvidia.com/cuda-toolkit-archive
https://www.nvidia.com/download/index.aspx
minimum requirements
minimum requirements
Vulkan Renderer
Docker Installation Guide
Install NVIDIA Drivers and Vulkan on Ubuntu
CUDA JIT Cache
https://help.ubuntu.com/lts/serverguide/firewall.html
here
here
AWS Marketplace page for HEAVY.AI
Tips for Securing Your EC2 Instance
Introduction to Heavy Immerse
Loading Data
Connecting to Your Linux Instance from Windows Using PuTTY
heavysql
Upgrading from Omnisci to HEAVY.AI 6.0
CUDA JIT Cache
the Micrsoft Azure home page
HEAVY.AI installation instructions
these instructions
checking your available quota for a project
heavyai-launcher-public project on Google Cloud Platform
HEAVY.AI Enterprise Edition (BYOL)
HEAVY.AI Enterprise Edition for CPU (BYOL)
HEAVY.AI Open Source Edition
HEAVY.AI for CPU (Open Source)
GPU Models for Compute Engine
here
https://releases.heavy.ai/ee/helm/heavyai-1.0.0.tgz
Option 1
Option 2
Option 3
Option 1
²
¹
¹

Security

CUDA Compatibility Drivers

This procedure is considered experimental.

Installing the Drivers

Use the following commands to install the CUDA 11 compatibility drivers on Ubuntu:

wget 
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin

mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 

apt-key adv --fetch-keys 
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub

add-apt-repository "deb 
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /" 

apt update 

nvidia-smi 

apt install cuda-compat-11-0 

nvidia-smi 

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/compat/ 

nvidia-smi

After the last nvidia-smi, ensure that CUDA shows the correct version.

The driver version will still show as the old version.

Updating systemd Files

After installing the drivers, update the systemd files in /lib/systemd/system/heavydb.service.

In the service section, add or update the environment property

Environment=LD_LIBRARY_PATH=/usr/local/cuda-11.0/compat:$LD_LIBRARY_PATHbash

The file should look like that

[Unit] 
Description=HEAVY.AI database server 
After=network.target remote-fs.target

[Service] 
Environment=LD_LIBRARY_PATH=/usr/local/cuda-11.0/compat:$LD_LIBRARY_PATH
User=heavyai 
Group=heavyai 
WorkingDirectory=/opt/heavyai
ExecStart=/opt/heavyai/bin/heavydb --config /var/lib/heavyai/heavy.conf 
KillMode=control-group 
SuccessExitStatus=143 
LimitNOFILE=65536 
Restart=always

[Install] 
WantedBy=multi-user.target

Then force the reload of the systemd configuration

// Some code

Upgrading from Omnisci to HEAVY.AI 6.0

This section is giving a recipe to upgrade from Omnisci platform 5.5+ to HEAVY.AI 6.0.

Considerations when Upgrading from Omnisci to HEAVY.AI Platform

If you are upgrading from Omnisci to HEAVY.AI, there are a lot of additional steps compared to a simple sub-version upgrade.

Before Upgrading to Release 6.0

IMPORTANT - Before you begin, stop all the running services / docker images of your Omnisci installation and create a backup $OMNISCI_STORAGE folder (typically /var/lib/omnisci). A backup is essential for recoverability; do not proceed with the upgrade without confirming that a full and consistent backup is available and ready to be restored.

The omnisci the database will not be automatically renamed to the new default name heavyai.This will be done manually and it's documented in the upgrade steps.

All the dumps created with the dump command on Omnisci cannot be restored after the database is upgraded to this version.

Essential Changes for release 6.0 of HEAVY.AI compared to Omnisci

The following table describes the changes to environment variables, storage locations, and filenames in Release 6.0 compared to Release 5.x. Except where noted, revised storage subfolders, symlinks for old folder names, and filenames are created automatically on server start.

Change descriptions in bold require user intervention.

Description
Omnisci 5.x
HEAVY.AI 6.0

Environmental variable for storage location

$OMNISCI_STORAGE

$HEAVYAI_BASE

Default location for $HEAVYAI_BASE / $OMNISCI_STORAGE

/var/lib/omnisci

/var/lib/heavyai

Fixed location for Docker $HEAVYAI_BASE / $OMNISCI_STORAGE

/omnisci-storage

/var/lib/heavyai

The folder containing catalogs for $HEAVYAI_BASE / $OMNISCI_STORAGE

data/

storage/

Storage subfolder - data

data/mapd_data

storage/data

Storage subfolder - catalog

data/mapd_catalogs

storage/catalogs

Storage subfolder - import

data/mapd_import

storage/import

Storage subfolder - export

data/mapd_export

storage/export

Storage subfolder - logs

data/mapd_log

storage/log

Server INFO logs

omnisci_server.INFO

heavydb.INFO

Server ERROR logs

omnisci_server.ERROR

heavydb.ERROR

Server WARNING logs

omnisci_server.WARNING

heavydb.WARNING

Web Server ACCESS logs

omnisci_web_server.ACCESS

heavy_web_server.ACCESS

Web Server ALL logs

omnisci_web_server.ALL

heavy_web_server.ALL

Install directory

/omnisci (Docker) /opt/omnisci (bare metal)

/opt/heavyai/ (Docker and bare metal)

Binary file - core server (located in install directory)

bin/omnsici_server

bin/heavydb

Binary file - web server (located in install directory)

bin/omnisci_web_server

bin/heavy_web_server

Binary file - command- line SQL utility

bin/omnisql

bin/heavysql

Binary file - JDBC jar

bin/omnisci-jdbc-5.10.2-SNAPSHOT.jar

bin/heavydb-jdbc-6.0.0-SNAPSHOT.jar

Binary file - Utilities (SqlImporter) jar

bin/omnisci-utility-5.10.2-SNAPSHOT.jar

bin/heavydb-utility-6.0.0-SNAPSHOT.jar

HEAVY.AI Server service (for bare metal install)

omnisci_server

heavydb

HEAVY.AI Web Server service (for bare metal install)

omnisci_web_server

heavy_web_server

Default configuration file

omnisci.conf

heavy.conf

Upgrade Instructions

The order of these instructions is significant. To avoid problems, follow the order of the instruction provided and don't skip any step.

Assumptions

This upgrade procedure is assuming that you are using the default storage location for both Omnisci and HEAVY.AI.

$OMNISCI_STORAGE

$HEAVYAI_BASE

/var/lib/omnisci

/var/lib/heavyai

Upgrading Using Docker

Stop all containers running Omnisci services.

In a terminal window, get the Docker container IDs:

sudo docker container ps --format "{{.Id}} {{.Image}}" \
-f status=running | grep omnisci\/

You should see an output similar to the following. The first entry is the container ID. In this example, it is 9e01e520c30c:

9e01e520c30c omnisci/omnisci-ee-gpu

Stop the HEAVY.AI Docker container. For example:

sudo docker container stop 9e01e520c3

Backup the Omnisci data directory (typically /var/lib/omnisci).

tar zxvf /backup_dir/omnisci_storage_backup.tar.gz /var/lib/omnisci

Rename the Omnisci data directory to reflect the HEAVY.AI naming scheme.

sudo mv /var/lib/omnisci /var/lib/heavyai
sudo mv /var/lib/heavyai/data /var/lib/heavyai/storage

Create a new configuration file for heavydb changing the data parameter to point to the renamed data directory.

cat /var/lib/heavyai/omnisci.conf | \
sed "s/^\(data.*=.*\)/#\1\\ndata = \"\/var\/lib\/heavyai\/storage\"/" | \
sed "s/^\(frontend.*=.*\)/#\1\\nfrontend = \"\/opt\/heavyai\/frontend\"/" 
>/var/lib/heavyai/heavy.conf

Rename the Omnisci license file (EE and FREE only).

mv /var/lib/heavyai/storage/omnisci.license \
/var/lib/heavyai/storage/heavyai.license

Download and run the 6.0 version of the HEAVY.AI Docker image.

Select the tab depending on the Edition (Enterprise, Free, or Open Source) and execution Device (GPU or CPU) you are upgrading.

sudo docker run -d --gpus=all \
-v /var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/heavyai-ee-cuda:v6.0.0
sudo docker run -d \
-v /var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/heavyai-ee-cpu:v6.0.0
sudo docker run -d --gpus=all \
-v /var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/core-os-cuda:v6.0.0
sudo docker run -d \
-v /var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/core-os-cpu:v6.0.0

Check that Docker is up and running using a docker ps command:

sudo docker container ps --format "{{.Id}} {{.Image}} {{.Status}}" \
-f status=running | grep heavyai\/

You should see output similar to the following:

9e01e520c30c heavyai/heavyai-ee-cuda Up 48 seconds ago 

Using the new container ID rename the default omnisci database to heavyai:

sudo docker exec -it 9e01e520c30c \
echo "alter database omnisci rename to heavyai;" \
| bin/heavysql omnisci 

Check that everything is running as expected.

Upgrading to HEAVY.AI Using Package Managers or Tarball

To upgrade an existing system installed with package managers or tarball. The commands upgrade HEAVY.AI in place without disturbing your configuration or stored data.

Back up the Omnisci Database

Stop the Omnisci services.

sudo systemctl stop omnisci_web_server omnisci_server

Backup the Omnisci data directory (typically /var/lib/omnisci).

tar zcvf /backup_dir/omnisci_storage_backup.tar.gz /var/lib/omnisci

Create a user named heavyai who will be the owner of the HEAVY.AI software and data on the filesystem.

sudo useradd --shell /bin/bash --user-group --create-home --group wheel heavyai
sudo useradd --shell /bin/bash --user-group --create-home --group sudo heavyai

Set a password for the user. It'll need when sudo-ing.

sudo passwd heavyai

Login with the newly created user

sudo su - heavyai

Rename the Omnisci data directory to reflect the HEAVY.AI naming scheme and change the ownership to heavyai user.

sudo chown -R heavyai:heavyai /var/lib/omnisci
sudo mv /var/lib/omnisci /var/lib/heavyai
mv /var/lib/heavyai/data /var/lib/heavyai/storage

Create the "semaphore" catalog directory; we'll have to remove it later "

mkdir /var/lib/heavyai/storage/catalogs

Check that everything is in order and that the "semaphore" directory is created,

ls -la /var/lib/heavyai/storage/

All the directories must belong to the heavyai user, and the directory catalogs would be present

total 32
drwxr-xr-x  8 heavyai heavyai 4096 lug 15 16:03 .
drwxr-xr-x  4 heavyai heavyai 4096 lug 15 16:02 ..
drwxrwxr-x  2 heavyai heavyai 4096 lug 15 16:03 catalogs
drwxr-xr-x  2 heavyai heavyai 4096 lug 15 15:54 mapd_catalogs
drwxr-xr-x 52 heavyai heavyai 4096 lug 15 15:54 mapd_data
drwxr-xr-x  2 heavyai heavyai 4096 lug 15 15:54 mapd_export
drwxr-xr-x  2 heavyai heavyai 4096 lug 15 15:54 mapd_log
drwxr-xr-x  2 heavyai heavyai 4096 lug 15 15:54 omnisci_disk_cache
-rw-r--r--  1 heavyai heavyai 1229 lug 15 16:07 omnisci-licence

Rename the license file. (EE and FREE only)

mv /var/lib/heavyai/storage/omnisci.license \
/var/lib/heavyai/storage/heavyai.license

Install the HEAVY.AI Software

Please follow all the installation and configuration steps until the Initialization step.

Update the configuration file and rename the default database

Log in with the heavyai user and ensure the heavyai services are stopped.

sudo systemctl stop heavy_web_server heavydb

Create a new configuration file for heavydb, changing the data parameter to point to the /var/lib/heavyai/storage directory and the frontend to the new install directory.

cat /var/lib/heavyai/omnisci.conf | \
sed "s/^\(data.*=.*\)/#\1\\ndata = \"\/var\/lib\/heavyai\/storage\"/" | \
sed "s/^\(frontend.*=.*\)/#\1\\nfrontend = \"\/opt\/heavyai\/frontend\"/" \
>/var/lib/heavyai/heavy.conf

All the settings of the upgraded database will be moved to the new configuration file.

Now we have to complete the database migration.

Remove the "semaphore" directory we previously created. (this is a fundamental step needed for the omnsci to heavydb upgrade)

rmdir /var/lib/heavyai/storage/catalogs

To complete the upgrade, start the HEAVY.AI servers.

sudo systemctl start heavydb heavy_web_server

Check if the database migrated, running this command and checking for the Rebrand migration complete message.

sudo systemctl status heavydb

Rename the default omnisci database to heavyai. Run the command using an administrative user (typically admin) with his password (default HyperInteractive)

echo "alter database omnisci rename to heavyai;" \
| /opt/heavyai/bin/heavysql -p HyperInteractive -u admin omnisci 

Restart the database service and check that everything is running as expected.

Remove Omnisci Software from the System

After all the checks confirmed that the upgraded system is stable, clean up the system to remove the Omnisci install and relative system configuration. Remove permanently the configuration of the services.

sudo rm /lib/systemd/omnisci_server*.service
sudo rm /lib/systemd/omnisci_web_server*.service
sudo systemctl daemon-reload
sudo systemctl reset-failed

Remove the installed software.

sudo rm -Rf /opt/omnisci

Delete the YUM or APT repositories.

sudo rm /etc/yum.repos.d/omnisci.repo
sudo rm /etc/apt/sources.list.d/omnisci.list

Using Services

HEAVY.AI features two system services: heavydb and heavy_web_server. You can start these services individually using systemd.

Starting and Stopping HeavyDB Using systemd

For permanent installations of HeavyDB, HEAVY.AI recommends that you use systemd to manage HeavyDB services. systemd automatically handles tasks such as log management, starting the services on restart, and restarting the services if there is a problem.

Initial Setup

You use the install_heavy_systemd.sh script to prepare systemd to run HEAVY.AI services. The script asks questions about your environment, then installs the systemd service files in the correct location. You must run the script as the root user so that the script can perform tasks such as creating directories and changing ownership.

cd $HEAVYAI_PATH/systemd
sudo ./install_heavy_systemd.sh

The install_heavy_systemd.sh script asks for the information described in the following table.

Variable

Use

Default

Notes

HEAVYAI_PATH

Path to HeavyDB installation directory

Current install directory

HEAVY.AI recommends heavyai as the install directory.

HEAVYAI_BASE

Path to the storage directory for HeavyDB data and configuration files

heavyai

Must be dedicated to HEAVY.AI. The installation script creates the directory $HEAVYAI_STORAGE/data, generates an appropriate configuration file, and saves the file as $HEAVYAI_STORAGE/heavy.conf.

HEAVYAI_USER

User HeavyDB is run as

Current user

User must exist before you run the script.

HEAVYAI_GROUP

Group HeavyDB is run as

Current user's primary group

Group must exist before you run the script.

Starting HeavyDB Using systemd

To manually start HeavyDB using systemd, run:

sudo systemctl start heavydb
sudo systemctl start heavy_web_server

Restarting HeavyDB Using systemd

You can use systemd to restart HeavyDB — for example, after making configuration changes:

sudo systemctl restart heavydb
sudo systemctl restart heavy_web_server

Stopping HeavyDB Using systemd

To manually stop HeavyDB using systemd, run:

sudo systemctl stop heavydb
sudo systemctl stop heavy_web_server

Enabling HeavyDB on Startup

To enable the HeavyDB services to start on restart, run:

sudo systemctl enable heavydb
sudo systemctl enable heavy_web_server

Using Configuration Parameters

Configuration Parameters for HeavyDB

Following are the parameters for runtime settings on HeavyDB. The parameter syntax provides both the implied value and the default value as appropriate. Optional arguments are in square brackets, while implied and default values are in parentheses.

For example, consider allow-loop-joins [=arg(=1)] (=0).

  • If you do not use this flag, loop joins are not allowed by default.

  • If you provide no arguments, the implied value is 1 (true) (allow-loop-joins).

  • If you provide the argument 0, that is the same as the default (allow-loop-joins=0).

  • If you provide the argument 1, that is the same as the implied value (allow-loop-joins=1).

Flag

Description

Default Value

allow-cpu-retry [=arg]

Allow the queries that failed on GPU to retry on CPU, even when watchdog is enabled. When watchdog is enabled, most queries that run on GPU and throw a watchdog exception fail. Turn this on to allow queries that fail the watchdog on GPU to retry on CPU. The default behavior is for queries that run out of memory on GPU to throw an error if watchdog is enabled. Watchdog is enabled by default.

TRUE[1]

allow-cpu-kernel-concurrency

Allow for multiple queries to run execution kernels concurrently on CPU.

Example: In a system with a number of executor of 4 (controlled by the parameter number-executors) 3+1 (the +1 is depending by the allow-cpu-gpu-kernel-concurrency) can run concurrently in the CPU.

DEFAULT: ON

allow-cpu-gpu-kernel-concurrency

Allow multiple queries to run execution kernels concurrently on CPU while a GPU query is executing.

Example: In a system with a number of executor of 4 (controlled by the parameter number-executors), on of the 4 slot can be used to run a GPU query, along with other 3 running on CPU.

DEFAULT: ON

allow-local-auth-fallback [=arg(=1)] (=0)

If SAML or LDAP logins are enabled, and the logins fail, this setting enables authentication based on internally stored login credentials. Command-line tools or other tools that do not support SAML might reject those users from logging in unless this feature is enabled. This allows a user to log in using credentials on the local database.

FALSE[0]

allow-loop-joins [=arg(=1)] (=0)

FALSE[0]

allowed-export-paths = ["root_path_1", root_path_2", ...]

Specify a list of allowed root paths that can be used in export operations, such as the COPY TO command. Helps prevent exploitation of security vulnerabilities and prevent server crashes, data breaches, and full remote control of the host machine. For example:

allowed-export-paths = ["/heavyai-storage/data/heavyai_export", "/home/centos"] The list of paths must be on the same line as the configuration parameter.

Allowed file paths are enforced by default. The default export path (<data directory>/heavyai_export) is allowed by default, and all child paths of that path are allowed.

When using commands with other paths, the provided paths must be under an allowed root path. If you try to use a nonallowed path in a COPY TO command, an error response is returned.

N/A

allow-s3-server-privileges

Allow S3 server privileges if IAM user credentials are not provided. Credentials can be specified with environment variables (such as AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and so on), an AWS credentials file, or when running on an EC2 instance, with an IAM role that is attached to the instance.

FALSE[0]

allowed-import-paths = ["root_path_1", "root_path_2", ...]

Specify a list of allowed root paths that can be used in import operations, such as the COPY FROM command. Helps prevent exploitation of security vulnerabilities and prevent server crashes, data breaches, and full remote control of the host machine.

For example:

allowed-import-paths = ["/heavyai-storage/data/heavyai_import", "/home/centos"] The list of paths must be on the same line as the configuration parameter.

Allowed file paths are enforced by default. The default import path (<data directory>/heavyai_import) is allowed by default, and all child paths of that allowed path are allowed.

When using commands with other paths, the provided paths must be under an allowed root path. If you try to use a nonallowed path in a COPY FROM command, an error response is returned.

N/A

approx_quantile_buffer arg

Size of a temporary buffer that is used to copy in the data for APPROX_MEDIAN calculation. When full, is sorted before being merged into the internal distribution buffer configured in approx_quantile_centroids.

1000

approx_quantile_centroids arg

Size of the internal buffer used to approximate the distribution of the data for which the APPOX_MEDIAN calculation is taken. The larger the value, the greater the accuracy of the answer.

300

auth-cookie-name arg

Configure the authentication cookie name. If not explicitly set, the default name is oat.

oat

bigint-count [=arg]

Use 64-bit count. Disabled by default because 64-bit integer atomics are slow on GPUs. Enable this setting if you see negative values for a count, indicating overflow. In addition, if your data set has more than 4 billion records, you likely need to enable this setting.

FALSE[0]

bitmap-memory-limitarg

Set the maximum amount of memory (in GB) allocated for APPROX_COUNT_DISTINCT bitmaps per execution kernel (thread or GPU).

8

calcite-max-mem arg

Max memory available to calcite JVM. Change if Calcite reports out-of-memory errors.

1024

calcite-port arg

Calcite port number. Change to avoid collisions with ports already in use.

6279

calcite-service-timeout

Service timeout value, in milliseconds, for communications with Calcite. On databases with large numbers of tables, large numbers of concurrent queries, or many parallel updates and deletes, Calcite might return less quickly. Increasing the timeout value can prevent THRIFT_EAGAIN timeout errors.

5000

columnar-large-projections[=arg]

Sets automatic use of columnar output, instead of row-wise output, for large projections.

TRUE

columnar-large-projections-threshold arg

Set the row-number threshold size for columnar output instead of row-wise output.

1000000

config arg

Path to heavy.conf. Change for testing and debugging.

$HEAVYAI_STORAGE/ heavy.conf

cpu-only

Run in CPU-only mode. Set this flag to force HeavyDB to run in CPU mode, even when GPUs are available. Useful for debugging and on shared-tenancy systems where the current HeavyDB instance does not need to run on GPUs.

FALSE

cpu-buffer- mem-bytes arg

Size of memory reserved for CPU buffers [bytes]. Change to restrict the amount of CPU/system memory HeavyDB can consume. A default value of 0 indicates no limit on CPU memory use. (HEAVY.AI Server uses all available CPU memory on the system.)

0

cuda-block-size arg

Size of block to use on GPU. GPU performance tuning: Number of threads per block. Default of 0 means use all threads per block.

0

cuda-grid-size arg

Size of grid to use on GPU. GPU performance tuning: Number of blocks per device. Default of 0 means use all available blocks per device.

0

data arg

Directory path to HEAVY.AI catalogs. Change for testing and debugging.

$HEAVYAI_STORAGE

db-query-list arg

N/A

dynamic-watchdog-time-limit [=arg]

Dynamic watchdog time limit, in milliseconds. Change if dynamic watchdog is stopping queries expected to take longer than this limit.

100000

enable-auto-clear-render-mem [=arg]

Enable/disable clear render gpu memory on out-of-memory errors during rendering. If an out-of-gpu-memory exception is thrown while rendering, many users respond by running \clear_gpu via the heavysql command-line interface to refresh/defrag the memory heap. This process can be automated with this flag enabled. At present, only GPU memory in the renderer is cleared automatically.

TRUE[1]

enable-auto-metadata-update [=arg]

Enable automatic metadata updates on UPDATE queries. Automatic metadata updates are turned on by default. Disabling may result in stale metadata and reductions in query performance.

TRUE[1]

enable-columnar-output [=arg]

Allows HEAVY.AI Core to directly materialize intermediate projections and the final ResultSet in Columnar format where appropriate. Columnar output is an internal performance enhancement that projects the results of an intermediate processing step in columnar format. Consider disabling this feature if you see unexpected performance regressions in your queries.

TRUE[1]

enable-data-recycler [=arg]

Set to TRUE to enable the data recycler. Enabling the recycler enables the following:

  • Hashtable recycler, which is the cache storage.

  • Hashing scheme recycler, which preserves a hashtable layout (such as perfect hashing and keyed hashing).

  • Overlaps hashtable tuning parameter recycler. Each overlap hashtable has its own parameters used during hashtable building.

TRUE[0]

enable-debug-timer [=arg]

Enable fine-grained query execution timers for debug. For debugging, logs verbose timing information for query execution (time to load data, time to compile code, and so on).

FALSE[0]

enable-direct-columnarization [=arg(=1)](=0)

Columnarization organizes intermediate results in a multi-step query in the most efficient way for the next step in the process. If you see an unexpected performance regression, you can try setting this value to false, enabling the earlier HEAVY.AI columnarization behavior.

TRUE[1]

enable-dynamic-watchdog [=arg]

Enable dynamic watchdog.

FALSE[0]

--enable-executor-resource-mgr [=arg]

Disable the executor resource manager.

TRUE[1]

enable-filter-push-down [=arg(=1)] (=0)

FALSE[0]

enable-foreign-table-scheduled-refresh [=arg]

Enable scheduled refreshes of foreign tables. Enables automated refresh of foreign tables with "REFRESH_TIMING_TYPE" option of "SCHEDULED" based on the specified refresh schedule.

TRUE[1]

enable-geo-ops-on-uncompressed-coords [=arg(=1)] (=0)

Allow geospatial operations ST_Contains and ST_Intersects to process uncompressed coordinates where possible to increase execution speed. Provides control over the selection of ST_Contains and ST_Intersects implementations. By default, for certain combinations of compressed geospatial arguments, such as ST_Contains(POLYGON, POINT), the implementation can process uncompressed coordinate values. This can result in much faster execution but could decrease precision. Disabling this option enables full decompression, which is slower but more precise.

TRUE[1]

enable-logs-system-tables [=arg(=1)] (=0)

Enable use of logs system tables. Also enables the Request Logs and Monitoring system dashboard (Enterprise Edition only).

FALSE[0]

enable-overlaps-hashjoin [=arg(=1)] (=0)

Enable the overlaps hash join framework allowing for range join (for example, spatial overlaps) computation using a hash table.

TRUE[1]

enable-runtime-query-interrupt [=arg(=1)] (=0)

FALSE[0]

enable-runtime-udf

Enable runtime user defined function registration. Enables runtime registration of user defined functions. This functionality is turned off unless you specifically request it, to prevent unintentional inclusion of nonstandard code. This setting is a precursor to more advanced object permissions planned in future releases.

FALSE[0]

enable-string-dict-hash-cache[=arg(=1)] (=0)

When importing a large table with low cardinality, set the flag to TRUE and leave it on to assist with bulk queries. If using String Dictionary Server, set the flag to FALSE if the String Dictionary server uses more memory than the physical system can support.

TRUE[1]

enable-thrift-logs [=arg(=1)] (=0)

Enable writing messages directly from Thrift to stdout/stderr. Change to enable verbose Thrift messages on the console.

FALSE[0]

enable-watchdog [arg]

Enable watchdog.

TRUE[1]

executor-cpu-result-mem-ratio

Set executor resource manager reserved memory for query result sets as a ratio greater than 0, representing the fraction of the system memory not allocatable for the CPU buffer pool. Values of 1.0 are permitted to allow over-subscription when warranted, but too high a value can cause out-of-memory errors.

Example: In a system with 256GB of Ram, the default of the cpu-buffer-size is 204.8GB so this ratio will be calculated on 51.2GB, limiting the maximum result set memory for a single query to 41GB

Executor-cpu-result-mem-bytes

Set executor resource manager reserved memory for query result sets in bytes. This overrides the default reservation of 80% the size of the system memory that is not allocated for the CPU buffer pool. Use 0 for auto.

DEFAULT: None (result memory size is controlled via the ratio setting above)

executor-per-query-max-cpu-threads-ratio

Set max fraction of executor resource manager total CPU slots/threads that can be allocated for a single query.

Note we allow executor-per-query-max-cpu-threads-ratio to have values > 1 to allow over-subscription of threads when warranted, given we may be overly pessimistic about kernel core occupation for some classes of queries. Care should be taken however with setting this value too high as thrashing and thread starvation can result. Example: on a physical server with 24 logical CPUs or in a VM with 24 vCPU the executor thread will be doubled to 48, so a value of 0.9 will use up 43 threads for a single query. This value can be lowered to lower memory requirements of single queries.

DEFAULT: 0.9

executor-per-query-max-cpu-result-mem-ratio

Set max fraction of executor resource manager total CPU result memory reservation that can be allocated for a single query.

Note we allow executor-per-query-max-cpu-result-mem-ratio to have values > 0 to allow over-subscription of memory when warranted, but user should be careful with this as too high a value can cause out-of-memory errors.

Default: 0.8

filter-push-down-low-frac

Higher threshold for selectivity of filters which are pushed down. Filters with selectivity lower than this threshold are considered for a push down.

filter-push-down-passing-row-ubound

Upper bound on the number of rows that should pass the filter if the selectivity is less than the high fraction threshold.

flush-log [arg]

Immediately flush logs to disk. Set to FALSE if this is a performance bottleneck.

TRUE[1]

from-table-reordering [=arg(=1)] (=1)

Enable automatic table reordering in FROM clause. Reorders the sequence of a join to place large tables on the inside of the join clause and smaller tables on the outside. HEAVY.AI also reorders tables between join clauses to prefer hash joins over loop joins. Change this value only in consultation with an HEAVY.AI engineer.

TRUE[1]

gpu-buffer-mem-bytes [=arg]

Size of memory reserved for GPU buffers in bytes per GPU. Change to restrict the amount of GPU memory HeavyDB can consume per GPU. A default value of 0 indicates no limit on GPU memory use (HeavyDB uses all available GPU memory across all active GPUs on the system).

0

Maximum amount of memory in bytes that can be used for the GPU code cache.

134217728 (128MB)

gpu-input-mem-limit arg

Force query to CPU when input data memory usage exceeds this percentage of available GPU memory. OmniSciDB loads data to GPU incrementally until data exceeds GPU memory, at which point the system retries on CPU. Loading data to GPU evicts any resident data already loaded or any query results that are cached. Use this limit to avoid attempting to load datasets to GPU when they obviously will not fit, preserving cached data on GPU and increasing query performance. If watchdog is enabled and allow-cpu-retry is not enabled, the query fails instead of re-running on CPU.

0.9

hashtable-cache-total-bytes [=arg]

The total size of the cache storage for hashtable recycler, in bytes. Increase the cache size to store more hashtables. Must be larger than or equal to the value defined in max-cacheable-hashtable-size-bytes.

4294967296 (4GB)

hll-precision-bits [=arg]

Number of bits used from the hash value used to specify the bucket number. Change to increase or decrease approx_count_distinct() precision. Increased precision decreases performance.

11

http-port arg

HTTP port number. Change to avoid collisions with ports already in use.

6278

idle-session-duration arg

Maximum duration of an idle session, in minutes. Change to increase or decrease duration of an idle session before timeout.

60

inner-join-fragment-skipping [=arg(=1)] (=0)

Enable or disable inner join fragment skipping. Enables skipping fragments for improved performance during inner join operations.

FALSE[0]

license arg

Path to the file containing the license key. Change if your license file is in a different location or has a different name.

log-auto-flush

Flush logging buffer to file after each message. Changing to false can improve performance, but log lines might not appear in the log for a very long time. HEAVY.AI does not recommend changing this setting.

TRUE[1]

log-directory arg

Path to the log directory. Can be either a relative path to the $HEAVYAI_STORAGE/data directory or an absolute path. Use this flag to control the location of your HEAVY.AI log files. If the directory does not exist, HEAVY.AI creates the top level directory. For example, a/b/c/logdir is created only if the directory path a/b/c already exists.

/var/lib/heavyai/ data/heavyai_log

log-file-name

Boilerplate for the name of the HEAVY.AI log files. You can customize the name of your HEAVY.AI log files. {SEVERITY} is the only braced token recognized. It allows you to create separate files for each type of error message greater than or equal to the log-severity configuration option.

heavydb.{SEVERITY}. %Y%m%d-%H%M%S.log

log-max-files

Maximum number of log files to keep. When the number of log files exceeds this number, HEAVY.AI automatically deletes the oldest files.

100

log-min-free-space

Minimum number of bytes left on device before oldest log files are deleted. This is a safety feature to be sure the disk drive of the log directory does not fill up, and guarantees that at least this many bytes are free.

20971520

log-rotation-size

Maximum file size in bytes before new log files are started. Change to increase/decrease size of files. If log files fill quickly, you might want to increase this number so that there are fewer log files.

10485760

log-rotate-daily

Start new log files at midnight. Set to false to write to log files until they are full, rather than restarting each day.

TRUE[1]

log-severity

Log to file severity levels:

DEBUG4

DEBUG3

DEBUG2

DEBUG1

INFO

WARNING

ERROR

FATAL

All levels after your chosen base severity level are listed. For example, if you set the severity level to WARNING, HEAVY.AI only logs WARNING, ERROR, and FATAL messages.

INFO

log-severity-clog

Log to console severity level: INFO WARNING ERROR FATAL. Output chosen severity messages to STDERR from running process.

WARNING

log-symlink

heavydb. {SEVERITY}.log

log-user-id

Log internal numeric user IDs instead of textual user names.

log-user-origin

Look up the origin of inbound connections by IP address and DNS name and print this information as part of stdlog. Some systems throttle DNS requests or have other network constraints that preclude timely return of user origin information. Set to FALSE to improve performance on those networks or when large numbers of users from different locations make rapid connect/disconnect requests to the server.

TRUE[1]

logs-system-tables-max-files-count [=arg]

Maximum number of log files that can be processed by each logs system table.

100

max-cacheable-hashtable-size-bytes [=arg]

Maximum size of the hashtable that the hashtable recycler can store. Limiting the size can enable more hashtables to be stored. Must be lesser than or equal to the value defined in hashtable-cache-total-bytes.

2147483648 (2GB)

max-session-duration arg

Maximum duration of the active session, in minutes. Change to increase or decrease session duration before timeout.

43200 (30 days)

null-div-by-zero [=arg]

Allows processing to complete when when the dataset would cause a divide by zero error. Set to TRUE if you prefer to return null when dividing by zero, and set to FALSE to throw an exception.

FALSE[0]

num-executors arg

Beta functionality in Release 5.7. Set the number of executors.

num-gpus arg

-1

num-reader-threads arg

Number of reader threads to use. Drop the number of reader threads to prevent imports from using all available CPU power. Default is to use all threads.

0

overlaps-bucket- threshold arg

The minimum size of a bucket corresponding to a given inner table range for the overlaps hash join.

-p | port int

HeavyDB server port. Change to avoid collisions with other services if 6274 is already in use.

6274

pending-query-interrupt-freq=arg

Frequency with which to check the interrupt status of pending queries, in milliseconds. Values larger than 0 are valid. If you set pending-query-interrupt-freq=100, each session's interrupt status is checked every 100 ms.

For example, assume you have three sessions (S1, S2, and S3) in your queue, and assume S1 contains a running query, and S2 and S3 hold pending queries. If you setpending-query-interrupt-freq=1000 both S2 and S3 are interrupted every 1000 ms (1 sec). See running-query-interrupt-freq for information about interrupting running queries. Decreasing the value increases the speed with which pending queries are removed, but also increases resource usage.

1000 (1 sec)

pki-db-client-auth [=arg]

Attempt authentication of users through a PKI certificate. Set to TRUE for the server to attempt PKI authentication.

FALSE[0]

read-only [=arg(=1)]

Enable read-only mode. Prevents changes to the dataset.

FALSE[0]

render-mem-bytes arg

Specifies the size of a per-GPU buffer that render query results are written to; allocated at the first rendering call. Persists while the server is running unless you run \clear_gpu_memory. Increase if rendering a large number of points or symbols and you get the following out-of-memory exception: Not enough OpenGL memory to render the query results.

Default is 500 MB.

500000000

render-oom-retry-threshold = arg

A render execution time limit in milliseconds to retry a render request if an out-of-gpu-memory error is thrown. Requires enable-auto-clear-render-mem = true. If enable-auto-clear-render-mem = true, a retry of the render request can be performed after an out-of-gpu-memory exception. A retry only occurs if the first run took less than the threshold set here (in milliseconds). The retry is attempted after the render gpu memory is automatically cleared. If an OOM exception occurs, clearing the memory might get the request to succeed. Providing a reasonable threshold might give more stability to memory-constrained servers w/ rendering enabled. Only a single retry is attempted. A value of 0 disables retries.

rendering [=arg]

Enable or disable backend rendering. Disable rendering when not in use, freeing up memory reserved by render-mem-bytes. To reenable rendering, you must restart HEAVY.AI Server.

TRUE[1]

res-gpu-mem =arg

Reserved memory for GPU. Reserves extra memory for your system (for example, if the GPU is also driving your display, such as on a laptop or single-card desktop). HEAVY.AI uses all the memory on the GPU except for render-mem-bytes + res-gpu-mem. Also useful if other processes, such as a machine-learning pipeline, share the GPU with HEAVY.AI. In advanced rendering scenarios or distributed setups, increase to free up additional memory for the renderer, or for aggregating results for the renderer from multiple leaf nodes. HEAVY.AI recommends always setting res-gpu-mem when using backend rendering.

134217728

running-query-interrupt-freq arg

Controls the frequency of interruption status checking for running queries. Range: 0.0 (less frequently) to 1.0 (more frequently).

For example, if you have 10 threads that evaluate a query of a table that has 1000 rows, then each thread advances its thread index up to 10 times. In this case, if you set the flag close to 1.0, you check a session's interrupt status for every increment of the thread index.

If we set the flag value as close to 0.0, you only check the session's interrupt status when the index increment is close to 10. The default value of running interrupt checking is close to half of the maximum increment of the thread index.

Frequent interrupt status checking reduces latency for the interrupt but also can decrease query performance.

seek-kafka-commit = <N>

Set the offset of the last Kafka message to be committed from a Kafka data stream. Set the offset of the last Kafka message to be committed from a Kafka data stream. This way, Kafka does not resend those messages. After the Kafka server commits messages through the number N, it resends messages starting at message N+1. This is particularly useful when you want to create a replica of the HEAVY.AI server from an existing data directory.

N/A

ssl-cert path

Path to the server's public PKI certificate (.crt file). Define the path the the .crt file. Used to establish an encrypted binary connection.

ssl-keystore path

Path to the server keystore. Used for an encrypted binary connection. The path to Java trust store containing the server's public PKI key. Used by HeavyDB to connect to the encrypted Calcite server port.

ssl-keystore-password password

The password for the SSL keystore. Used to create a binary encrypted connection to the Calcite server.

ssl-private-key path

Path to the server's private PKI key. Define the path to the HEAVY.AI server PKI key. Used to establish an encrypted binary connection.

ssl-trust-ca path

Enable use of CA-signed certificates presented by Calcite. Defines the file that contains trusted CA certificates. This information enables the server to validate the TCP/IP Thrift connections it makes as a client to the Calcite server. The certificate presented by the Calcite server is the same as the certificate used to identify the database server to its clients.

ssl-trust-ca-server path

ssl-trust-password password

The password for the SSL trust store. Password to the SSL trust store containing the server's public PKI key. Used to establish an encrypted binary connection.

ssl-trust-store path

The path to Java trustStore containing the server's public PKI key. Used by the Calcite server to connect to the encrypted OmniSci server port, to establish an encrypted binary connection.

start-gpu arg

FALSE[0]

trivial-loop-join-threshold [=arg]

The maximum number of rows in the inner table of a loop join considered to be trivially small.

1000

use-hashtable-cache

Set to TRUE to enable the hashtable recycler. Supports complex scenarios, such as hashtable recycling for queries that have subqueries.

TRUE[1]

vacuum-min-selectivity [=arg]

Specify the percentage (with a value of 0 implying 0% and a value of 1 implying 100%) of deleted rows in a fragment at which to perform automatic vacuuming.

Automatic vacuuming occurs when deletes or updates on variable-length columns result in a percentage of deleted rows in a fragment exceeding the specified threshold. The default threshold is 10% of deleted rows in a fragment.

When changing this value, consider the most common types of queries run on the system. In general, if you have infrequent updates and deletes, set vacuum-min-selectivity to a low value. Set it higher if you have frequent updates and deletes, because vacuuming adds overhead to affected UPDATE and DELETE queries.

watchdog-none-encoded-string-translation-limit [=arg]

The number of strings that can be casted using the ENCODED_TEXT string operator.

1,000,000

window-function-frame-aggregation-tree-fanout [=arg]

Tree fan out of the aggregation tree is used to compute aggregation over the window frame.

8

Additional Enterprise Edition Parameters

Following are additional parameters for runtime settings for the Enterprise Edition of HeavyDB. The parameter syntax provides both the implied value and the default value as appropriate. Optional arguments are in square brackets, while implied and default values are in parentheses.

Flag

Description

Default Value

cluster arg

Path to data leaves list JSON file. Indicates that the HEAVY.AI server instance is an aggregator node, and where to find the rest of its cluster. Change for testing and debugging.

$HEAVYAI_BASE

compression-limit-bytes [=arg(=536870912)] (=536870912)

Compress result sets that are transferred between leaves. Minimum length of payload above which data is compressed.

536870912

compressor arg (=lz4hc)

lz4hc

ldap-dn arg

LDAP Distinguished Name.

ldap-role-query-regex arg

RegEx to use to extract role from role query result.

ldap-role-query-url arg

LDAP query role URL.

ldap-superuser-role arg

The role name to identify a superuser.

ldap-uri arg

LDAP server URI.

leaf-conn-timeout [=arg]

Leaf connect timeout, in milliseconds. Increase or decrease to fail Thrift connections between HeavyDB instances more or less quickly if a connection cannot be established.

20000

leaf-recv-timeout [=arg]

Leaf receive timeout, in milliseconds. Increase or decrease to fail Thrift connections between HeavyDB instances more or less quickly if data is not received in the time allotted.

300000

leaf-send-timeout [=arg]

Leaf send timeout, in milliseconds. Increase or decrease to fail Thrift connections between HeavyDB instances more or less quickly if data is not sent in the time allotted.

300000

saml-metadata-file arg

Path to identity provider metadata file.

Required for running SAML. An identity provider (like Okta) supplies a metadata file. From this file, HEAVY.AI uses:

  1. Public key of the identity provider to verify that the SAML response comes from it and not from somewhere else.

  2. URL of the SSO login page used to obtain a SAML token.

saml-sp-target-url arg

URL of the service provider for which SAML assertions should be generated. Required for running SAML. Used to verify that a SAML token was issued for HEAVY.AI and not for some other service.

saml-sync-roles arg (=0)

Enable mapping of SAML groups to HEAVY.AI roles. The SAML Identity provider (for example, Okta) automatically creates users at login and assigns them roles they already have as groups in SAML.

saml-sync-roles [=0]

string-servers arg

Path to string servers list JSON file. Indicates that HeavyDB is running in distributed mode and is required to designate a leaf server when running in distributed mode.

Configuration Parameters for HEAVY.AI Web Server

Following are the parameters for runtime settings on HeavyAI Web Server. The parameter syntax provides both the implied value and the default value as appropriate. Optional arguments are in square brackets, while implied and default values are in parentheses.

Flag

Description

Default

additional-file-upload-extensions <string>

Denote additional file extensions for uploads. Has no effect if --enable-upload-extension-check is not set.

allow-any-origin

Allows for a CORS exception to the same-origin policy. Required to be true if Immerse is hosted on a different domain or subdomain hosting heavy_web_server and heavydb.

Allowing any origin is a less secure mode than what heavy_web_server requires by default.

--allow-any-origin = false

-b | backend-url <string>

URL to http-port on heavydb. Change to avoid collisions with other services.

http://localhost:6278

-B | binary-backend-url <string>

URL to http-binary-port on heavydb.

http://localhost:6276

cert string

Certificate file for HTTPS. Change for testing and debugging.

cert.pem

-c | config <string>

Path to HeavyDB configuration file. Change for testing and debugging.

-d | data <string>

Path to HeavyDB data directory. Change for testing and debugging.

data

data-catalog <string>

Path to data catalog directory.

n/a

docs string

Path to documentation directory. Change if you move your documentation files to another directory.

docs

enable-binary-thrift

Use the binary thrift protocol.

TRUE[1]

enable-browser-logs [=arg]

Enable access to current log files via web browser. Only super users (while logged in) can access log files.

Log files are available at http[s]://host:port/logs/log_name.

The web server log files: ACCESS - http[s]://host:port/logs/access ALL - http[s]://host:port/logs/all

HeavyDB log files: INFO - http[s]://host:port/logs/info WARNING - http[s]://host:port/logs/warning ERROR - http[s]://host:port/logs/

FALSE[0]

enable-cert-verification

TLS certificate verification is a security measure that can be disabled for the cases of TLS certificates not issued by a trusted certificate authority. If using a locally or unofficially generated TLS certificate to secure the connection between heavydb and heavy_web_server, this parameter must be set to false. heavy_web_server expects a trusted certificate authority by default.

--enable-cert-verification = true

enable-cross-domain [=arg]

Enable frontend cross-domain authentication. Cross-domain session cookies require the SameSite = None; Secure headers. Can only be used with HTTPS domains; requires enable-https to be true.

FALSE[0]

enable-https

Enable HTTPS support. Change to enable secure HTTP.

enable-https-authentication

Enable PKI authentication.

enable-https-redirect [=arg]

FALSE[0]

enable-non-kernel-time-query-interrupt

Enable non-kernel-time query interrupt.

TRUE[1]

enable-runtime-query-interrupt

Enbale runtime query interrupt.

TRUE[1]

enable-upload-extension-check

Disables restrictive file extension upload check.

encryption-key-file-path <string>

Path to the file containing the credential payload cipher key. Key must be 256 bits in length.

-f | frontend string

Path to frontend directory. Change if you move the location of your frontend UI files.

frontend

http-to-https-redirect-port = arg

6280

idle-session-duration = arg

Idle session default, in minutes.

60

jupyter-prefix-string <string>

Jupyter Hub base_url for Jupyter integration.

/jupyter

jupyter-url-string <string>

URL for Jupyter integration.

-j |jwt-key-file

Path to a key file for client session encryption.

The file is expected to be a PEM-formatted ( .pem ) certificate file containing the unencrypted private key in PKCS #1, PCKS #8, or ASN.1 DER form.

Example PEM file creation using OpenSSL.

Required only if using a high-availability server configuration or another server configuration that requires an instance of Immerse to talk to multiple heavy_web_server instances.

Each heavy_web_server instance needs to use the same encryption key to encrypt and decrypt client session information which is used for session persistence ("sessionization") in Immerse.

key <string>

Key file for HTTPS. Change for testing and debugging.

key.pem

max-tls-version

Refers to the version of TLS encryption used to secure web protocol connections. Specifies a maximum TLS version.

min-tls-version

Refers to the version of TLS encryption used to secure web protocol connections. Specifies a minimum TLS version.

--min-tls-version = VersionTLS12

peer-cert <string>

Peer CA certificate PKI authentication.

peercert.pem

-p | port int

Frontend server port. Change to avoid collisions with other services.

6273

-r | read-only

Enable read-only mode. Prevent changes to the data.

secure-acao-uri

If set, ensures that all Access-Allow-Origin headers are set to the value provided.

servers-json <string>

Path to servers.json. Change for testing and debugging.

session-id-header <string>

Session ID header.

immersesid

ssl-cert <string>

SSL validated public certificate.

sslcert.pem

ssl-private-key <string>

SSL private key file.

sslprivate.key

strip-x-headers <strings>

List of custom X http request headers to be removed from incoming requests. Use --strip-x-headers=""to allow all X headers through.

[X-HeavyDB-Username]

timeout duration

Maximum request duration in #h#m#s format. For example 0h30m0s represents a duration of 30 minutes. Controls the maximum duration of individual HTTP requests. Used to manage resource exhaustion caused by improperly closed connections. This also limits the execution time of queries made over the Thrift HTTP transport. Increase the duration if queries are expected to take longer than the default duration of one hour; for example, if you COPY FROM a large file when using heavysql with the HTTP transport.

1h0m0s

tls-cipher-suites <strings>

Refers to the combination of algorithms used in TLS encryption to secure web protocol connections.

All available TLS cipher suites compatible with HTTP/2:

  • TLS_RSA_WITH_RC4_128_SHA

  • TLS_RSA_WITH_AES_128_CBC_SHA

  • TLS_ECDHE_RSA_WITH_AES_128_ GCM_SHA256

  • TLS_ECDHE_ECDSA_WITH_AES_128_ GCM_SHA256

  • TLS_ECDHE_RSA_WITH_AES_256_ GCM_SHA384

  • TLS_ECDHE_ECDSA_WITH_AES_256_ GCM_SHA384

  • TLS_ECDHE_RSA_WITH_CHACHA20_ POLY1305

  • TLS_ECDHE_ECDSA_WITH_CHACHA20_ POLY1305

  • TLS_AES_128_GCM_SHA256

  • TLS_AES_256_GCM_SHA384

  • TLS_CHACHA20_POLY1305_SHA256

  • TLS_FALLBACK_SCSV

    <code></code>

    Limit security vulnerabilities by specifying the allowed TLS ciphers in the encryption used to secure web protocol connections.

The following cipher suites are accepted by default:

  • TLS_ECDHE_RSA_WITH_AES_128_ GCM_SHA256

  • TLS_ECDHE_ECDSA_WITH_AES_128_ GCM_SHA256

  • TLS_ECDHE_RSA_WITH_AES_256_ GCM_SHA384

  • TLS_RSA_WITH_AES_256_GCM_ SHA384

tls-curves <strings>

Refers to the types of Elliptic Curve Cryptography (ECC) used in TLS encryption to secure web protocol connections.

All available TLS elliptic Curve IDs:

  • secp256r1 (Curve ID P256)

  • CurveP256 (Curve ID P256)

  • secp384r1 (Curve ID P384)

  • CurveP384 (Curve ID P384)

  • secp521r1 (Curve ID P521)

  • CurveP521 (Curve ID P521)

  • x25519 (Curve ID X25519)

  • X25519 (Curve ID X25519)

    Limit security vulnerabilities by specifying the allowed TLS cipher suites in the encryption used to secure web protocol connections.

The following TLS curves are accepted by default:

  • CurveP521

  • CurveP384

  • CurveP256

tmpdir string

Path for temporary file storage. Used as a staging location for file uploads. Consider locating this directory on the same file system as the HEAVY.AI data directory. If not specified on the command line, heavyai_web_server recognizes the standard TMPDIR environment variable as well as a specific HEAVYAI_TMPDIR environment variable, the latter of which takes precedence. If you use neither the command-line argument nor one of the environment variables, the default, /tmp/ is used.

/tmp

ultra-secure-mode

Enables secure mode that sets Access-Allow-Origin headers to --secure-acao-uriand sets security headers like X-Frame-Options, Content-Security-Policy, and Strict-Transport-Security.

-v | verbose

Enable verbose logging. Adds log messages for debugging purposes.

version

Return version.

Encrypted Credentials in Custom Applications

HEAVY.AI can accept a set of encrypted credentials for secure authentication of a custom application. This topic provides a method for providing an encryption key to generate encrypted credentials and configuration options for enabling decryption of those encrypted credentials.

Generating an Encryption Key

Configuring the Web Server

Set the file path of the encryption key file to the encryption-key-file-path web server parameter in heavyai.conf:

Alternatively, you can set the path using the --encryption-key-file-path=path/to/file command-line argument.

Generating Encrypted Credentials

Connecting Using SAML

Security Assertion Markup Language (SAML) is used for exchanging authentication and authorization data between security domains. SAML uses security tokens containing assertions (statements that service providers use to make decisions about access control) to pass information about a principal (usually an end user) between a SAML authority, named an Identity Provider (IdP), and a SAML consumer, named a Service Provider (SP). SAML enables web-based, cross-domain, single sign-on (SSO), which helps reduce the administrative overhead of sending multiple authentication tokens to the user.

If you use SAML for authentication to HEAVY.AI, and SAML login fails, HEAVY.AI automatically falls back to log in using LDAP if it is configured.

If both SAML and LDAP authentication fail, you are authenticated against a locally stored password, but only if the allow-local-auth-fallback flag is set.

  1. A user uses a login page to connect to HEAVY.AI.

  2. The HEAVY.AI login page redirects the user to the Okta login page.

  3. The user signs in using an Okta account. (This step is skipped if the user is already logged in to Okta.)

  4. Okta returns a base64-encoded SAML Response to the user, which contains a SAML Assertion that the user is allowed to use HEAVY.AI. If configured, it also returns a list of SAML Groups assigned to the user.

  5. Okta redirects the user to the HEAVY.AI login page together with the SAML response (a token).

  6. HEAVY.AI verifies the token, and retrieves the user name and groups. Authentication and authorization is complete.

In addition to Okta, the following SAML providers are also supported:

Registering Your SAML Application in Okta

1) Log into your Okta account and click the Admin button.

2) From the Applications menu, select Applications.

3) Click the Add Application button.

4) On the Add Application screen, click Create New App.

5) On the Create a New Application Integration page, set the following details:

  • Platform: Web

  • Sign on Method: SAML 2.0

    And then, click Create.

6) On the Create SAML Integration page, in the App name field, type Heavyai and click Next.

7) In the SAML Settings page, enter the following information:

  • Audience URI (SP Entity ID): Your Heavy Immerse web URL with the suffix saml-post.

  • Default RelayState: Forward slash (/).

  • Application username: HEAVY.AI recommends using the email address you used to log in to Okta.

Leave other settings at their default values, or change as required for your specific installation.

After making your selections, click Next.

8) In the Help Okta Support... page, click I'm an Okta customer adding an internal app. All other questions on this page are optional.

After making your selections, click Finish.

Your application is now registered and displayed, and the Sign On tab is selected.

Configuring SAML for Your HEAVY.AI Application

Before configuring SAML, make sure that HTTPS is enabled on your web server.

On the Sign On tab, configure SAML settings for your application:

1) On the Settings page, click View Setup Instructions.

2) On the How to Configure SAML 2.0 for HEAVY.AI Application page, scroll to the bottom, copy the XML fragment in the Provide the following IDP metadata to your SP provider box, and save it as a raw text file called idp.xml.

3) Upload idp.xml to your HEAVY.AI server in $HEAVYAI_STORAGE.

4) Edit heavy.conf and add the following configuration parameters:

  • saml-metadata-file: Path to the idp.xml file you created.

  • saml-sp-target-url: Web URL to your Heavy Immerse saml-post endpoint.

  • saml-signed-assertion: Boolean value that determines whether Okta signs the assertion; true by default.

  • saml-signed-response: Boolean value that determines whether Okta signs the response; true by default.

    For example:

  • In the web section, add the full physical path to the servers.json file; for example:

5) On the How to Configure SAML 2.0 for HEAVY.AI Application page, copy the Identity Provider Single Sign-On URL, which looks similar to this:

6) If the servers.json file you identified in the [web] section of heavy.conf does not exist, create it. In servers.json, include the SAMLurl property, using the same value you copied in Identify Provider Single Sign-On URL. For example:

7) Restart the heavyai_server and heavyai_web_server services.

Auto-Creating Users with SAML

Users can be automatically created in HEAVY.AI based on group membership:

1) Go to the Application Configuration page for the HEAVY.AI application in Okta.

2) On the General tab, scroll to the SAML Settings section and click the Edit button.

3) Click the Next button, and then in the Group Attribute Statements section, set the following:

  • Name: Groups

  • Filter: Set to the desired filter type to determine the set of groups delivered to HEAVY.AI through the SAML response. In the text box next to the Filter type drop-down box, enter the text that defines the filter.

  • Click Next, and then click Finish.

Any group that requires access to HEAVY.AI must be created in HEAVY.AI before users can log in.

  1. Modify your heavyai.conf file by adding the following parameter:

    The heavyai.conf entries now look like this:

  2. Restart the heavyai_server and heavyai_web_server processes.

Users whose group membership in Okta contains a group name that exists in HeavyDB can log in and have the privileges assigned to their groups.

Creating Users Manually

1) On the Okta website, on the Assignments tab, click Assign > Assign to People.

2) On the Assign HEAVY.AI to People panel, click the Assign button next to users that you want to provide access to HEAVY.AI.

3) Click Save and Go Back to assign HEAVY.AI to the user.

) Repeat steps 2 and 3 for all users to whom you want to grant access. Click Done when you are finished.

Verifying SAML Configuration

Verify that the SAML is configured correctly by opening your Heavy Immerse login page. You should be automatically redirected to the Okta login page, and then back to Immerse, without entering credentials.

When you log out of Immerse, you see the following screen:

Logging out of Immerse does not log you out of Okta. If you log back in to Immerse and are still logged in to Okta, you do not need to reathenticate.

If authentication fails, you see this error message when you attempt to log in through Okta:

To resolve the authentication error:

  1. Add the license information by either:

    • Adding heavyai.license to your HEAVY.AI data directory.

    • Logging in to HeavyDB and run the following command:

  2. Reattempt login through Okta.

The Information about authentication errors can be found in the log files.

Implementing a Secure Binary Interface

Follow these instructions to start an HEAVY.AI server with an encrypted main port.

Required PKI Components

You need the following PKI (Public Key Infrastructure) components to implement a Secure Binary Interface.

  • A CRT (short for certificate) file containing the server's PKI certificate. This file must be shared with the clients that connect using encrypted communications. Ideally, this file is signed by a recognized certificate issuing agency.

  • A key file containing the server's private key. Keep this file secret and secure.

  • A Java TrustStore containing the server's PKI certificate. The password for the trust store is also required.

Although in this instance the trust store contains only information that can be shared, the Java TrustStore program requires it to be password protected.

  • A Java KeyStore and password.

  • In a distributed system, add the configuration parameters to the heavyai.conf file on the aggregator and all leaf nodes in your HeavyDB cluster.

Demonstration Script to Create "Mock/Test" PKI Components

You can use OpenSSL utilities to create the various PKI elements. The server certificate in this instance is self-signing, and should not be used in a production system.

  1. Generate a new private key.

  2. Use the private key to generate a certificate signing request.

  3. Self sign the certificate signing request to create a public certificate.

  4. Use the Java tools to create a key store from the public certificate.

To generate a keystore file from your server key:

  1. Copy server.key to server.txt. Concatenate it with server.crt.

  2. Use server.txt to create a PKCS12 file.

  3. Use server.p12 to create a keystore.

Start the Server in Encrypted Mode with PKI Client Authentication

Start the server using the following options.

Example

Configuring heavyai.conf for Encrypted Connection

Alternatively, you can add the following configuration parameters to heavyai.conf to establish a Secure Binary Interface. The following configuration flags implement the same encryption shown in the runtime example above:

Passwords for the SSL truststore and keystore can be enclosed in single (') or double (") quotes.

Why Use Both server.crt and a Java TrustStore?

The server.crt file and the Java truststore contain the same public key information in different formats. Both are required by the server to establish both the secure client communication with the various interfaces and with its Calcite server. At startup, the Java truststore is passed to the Calcite server for authentication and to encrypt its traffic with the HEAVY.AI server.

LDAP Integration

HEAVY.AI supports LDAP authentication using an IPA Server or Microsoft Active Directory.

You can configure HEAVY.AI Enterprise edition to map LDAP roles 1-to-1 to HEAVY.AI roles. When you enable this mapping, LDAP becomes the main authority controlling user roles in HEAVY.AI.

LDAP mapping is available only in HEAVY.AI Enterprise edition.

HEAVY.AI supports five configuration settings that allow you to integrate with your LDAP server.

Obtaining Credential Information

To find the ldap-role-query-url and ldap-role-query-regex to use, query your user roles. For example, if there is a user named kiran on the IPA LDAP server ldap://myldapserver.mycompany.com, you could use the following curl command to get the role information:

When successful, it returns information similar to the following:

  • ldap-dn matches the DN, which is uid=kiran,cn=users,cn=accounts,dc=mycompany,dc=com.

  • ldap-role-query-url includes the LDAP URI + the DN + the LDAP attribute that represents the role/group the member belongs to, such as memberOf.

  • ldap-role-query-regex is a regular expression that matches the role names. The matching role names are used to grant and revoke privileges in HEAVY.AI. For example, if we created some roles on an IPA LDAP server where the role names begin with MyCompany_ (for example, MyCompany_Engineering, MyCompany_Sales, MyCompany_SuperUser), the regular expression can filter the role names using MyCompany_.

  • ldap-superuser-role is the role/group name for HEAVY.AI users who are superusers once they log on to the HEAVY.AI database. In this example, the superuser role name is MyCompany_SuperUser.

Make sure that LDAP configuration appears before the [web] section of heavy.conf.

Double quotes are not required for LDAP properties in heavy.conf. For example, both of the following are valid:

ldap-uri = "ldap://myldapserver.mycompany.com" ldap-uri = ldap://myldapserver.mycompany.com

Setting Up LDAP with HEAVY.AI

To integrate LDAP with HEAVY.AI, you need the following:

  • A functional LDAP server, with all users/roles/groups created (ldap-uri, ldap-dn, ldap-role-query-url, ldap-role-query-regex, and ldap-superuser-role) to be used by HEAVY.AI. You can use the curl command to test and find the filters.

  • A functional HEAVY.AI server, version 4.1 or higher.

Once you have your server information, you can configure HEAVY.AI to use LDAP authentication.

  1. Locate the heavy.conf file and edit it to include the LDAP parameter. For example:

  2. Restart the HEAVY.AI server:

  3. Log on to heavysql as MyCompany user, or any user who belongs to one of the roles/groups that match the filter.

When you use LDAP authentication, the default admin user and password HyperInteractive do not work unless you create the admin user with the same password on the LDAP server.

If your login fails, inspect $HEAVYAI_STORAGE/mapd_log/heavyai_server.INFO to check for any obvious errors about LDAP authentication.

Once you log in, you can create a new role name in heavysql, and then apply GRANT/REVOKE privileges to the role. Log in as another user with that role and confirm that GRANT/REVOKE works.

If you refresh the browser window, you are required to log in and reauthenticate.

Using LDAPS

To use LDAPS, HEAVY.AI must trust the LDAP server's SSL certificate. To achieve this, you must have the CA for the server's certificate, or the server certificate itself. Install the certificate as a trusted certificate.

IPA on CentOS

To use IPA as your LDAP server with HEAVY.AI running on CentOS 7:

  1. Copy the IPA server CA certificate to your local machine.

  2. Update the PKI certificates.

  3. Edit /etc/openldap/ldap.conf to add the following line.

  4. Locate the heavy.conf file and edit it to include the LDAP parameter. For example:

  5. Restart the HEAVY.AI server:

IPA on Ubuntu

To use IPA as your LDAP server with HEAVY.AI running on Ubuntu:

  1. Copy the IPA server CA certificate to your local machine.

  2. Rename ipa-ca.crm to ipa-ca.crt so that the certificates bundle update script can find it:

  3. Update the PKI certificates:

  4. Edit /etc/openldap/ldap.conf to add the following line:

  5. Locate the heavy.conf file and edit it to include the LDAP parameter. For example:

  6. Restart the HEAVY.AI server:

Active Directory

1. Locate the heavy.conf file and edit it to include the LDAP parameter.

Example 1:

Example 2:

2. Restart the HEAVY.AI server:

Other LDAP user authentication attributes, such as userPrincipalName, are not currently supported.

Kafka

Creating a Topic

Create a sample topic for your Kafka producer.

  1. Run the kafka-topics.sh script with the following arguments:

  2. Create a file named myfile that consists of comma-separated data. For example:

  3. Use heavysql to create a table to store the stream.

Using the Producer

Load your file into the Kafka producer.

  1. Create and start a producer using the following command.

Using the Consumer

Load the data to HeavyDB using the Kafka console consumer and the KafkaImporter program.

  1. Pull the data from Kafka into the KafkaImporter program.

  2. Verify that the data arrived using heavysql.

Distributed Configuration

When installing a distributed cluster, you must run initdb --skip-geo to avoid the automatic creation of the sample geospatial data table. Otherwise, metadata across the cluster falls out of synchronization and can put the server in an unusable state.

HEAVY.AI supports distributed configuration, which allows single queries to span more than one physical host when the scale of the data is too large to fit on a single machine.

In addition to increased capacity, distributed configuration has other advantages:

  • Writes to the database can be distributed across the nodes, thereby speeding up import.

  • Reads from disk are accelerated.

  • Additional GPUs in a distributed cluster can significantly increase read performance in many usage scenarios. Performance scales linearly, or near linearly, with the number of GPUs, for simple queries requiring little communication between servers.

  • Multiple GPUs across the cluster query data on their local hosts. This allows processing of larger datasets, distributed across multiple servers.

HEAVY.AI Distributed Cluster Components

A HEAVY.AI distributed database consists of three components:

  • An aggregator, which is a specialized HeavyDB instance for managing the cluster

  • One or more leaf nodes, each being a complete HeavyDB instance for storing and querying data

  • A String Dictionary Server, which is a centralized repository for all dictionary-encoded items

Conceptually, a HEAVY.AI distributed database is horizontally sharded across n leaf nodes. Each leaf node holds one nth of the total dataset. Sharding currently is round-robin only. Queries and responses are orchestrated by a HEAVY.AI Aggregator server.

The HEAVY.AI Aggregator

Clients interact with the aggregator. The aggregator orchestrates execution of a query across the appropriate leaf nodes. The aggregator composes the steps of the query execution plan to send to each leaf node, and manages their results. The full query execution might require multiple iterations between the aggregator and leaf nodes before returning a result to the client.

A core feature of the HeavyDB is back-end, GPU-based rendering for data-rich charts such as point maps. When running as a distributed cluster, the backend rendering is distributed across all leaf nodes, and the aggregator composes the final image.

String Dictionary Server

The String Dictionary Server manages and allocates IDs for dictionary-encoded fields, ensuring that these IDs are consistent across the entire cluster.

The server creates a new ID for each new encoded value. For queries returning results from encoded fields, the IDs are automatically converted to the original values by the aggregator. Leaf nodes use the string dictionary for processing joins on encoded columns.

For moderately sized configurations, the String Dictionary Server can share a host with a leaf node. For larger clusters, this service can be configured to run on a small, separate CPU-only server.

Replicated Tables

A table is split by default to 1/nth of the complete dataset. When you create a table used to provide dimension information, you can improve performance by replicating its contents onto every leaf node using the partitions property. For example:

This reduces the distribution overhead during query execution in cases where sharding is not possible or appropriate. This is most useful for relatively small, heavily used dimension tables.

Data Loading

You can load data to a HEAVY.AI distributed cluster using a COPY FROM statement to load data to the aggregator, exactly as with HEAVY.AI single-node processing. The aggregator distributes data evenly across the leaf nodes.

Data Compression

Records transferred between systems in a HEAVY.AI cluster are compressed to improve performance. HEAVY.AI uses the LZ4_HC compressor by default. It is the fastest compressor, but has the lowest compression rate of the available algorithms. The time required to compress each buffer is directly proportional to the final compressed size of the data. A better compression rate will likely require more time to process.

You can specify another compressor on server startup using the runtime flag compressor. Compressor choices include:

  • blosclz

  • lz4

  • lz4hc

  • snappy

  • zlib

  • zstd

For more information on the compressors used with HEAVY.AI, see also:

HEAVY.AI does not compress the payload until it reaches a certain size. The default size limit is 512MB. You can change the size using the runtime flag compression-limit-bytes.

HEAVY.AI Distributed Cluster Example

This example uses four GPU-based machines, each with a combination of one or more CPUs and GPUs.

Install HEAVY.AI server on each node. For larger deployments, you can have the install on a shared drive.

Set up the configuration file for the entire cluster. This file is the same for all nodes.

In the cluster.conf file, the location of each leaf node is identified as well as the location of the String Dictionary server.

Here, dbleaf is a leaf node, and string is the String Dictionary Server. The port each node is listening on is also identified. These ports must match the ports configured on the individual server.

Each leaf node requires a heavy.conf configuration file.

The parameter string-servers identifies the file containing the cluster configuration, to tell the leaf node where the String Dictionary Server is.

The aggregator node requires a slightly different heavy.conf. The file is named heavy-agg.conf in this example.

heavy-agg.conf

The parameter cluster tells the HeavyDB instance that it is an aggregator node, and where to find the rest of its cluster.

If your aggregator node is sharing a machine with a leaf node, there might be a conflict on the calcite-port. Consider changing the port number of the aggregator node to another that is not in use.

Implementing a HEAVY.AI Distributed Cluster

Contact HEAVY.AI support for assistance with HEAVY.AI Distributed Cluster implementation.

Using Heavy Immerse Data Manager

Heavy Immerse supports file upload for .csv, .tsv, and .txt files, and supports comma, tab, and pipe delimiters.

Heavy Immerse also supports upload of compressed delimited files in TAR, ZIP, 7-ZIP, RAR, GZIP, BZIP2, or TGZ format.

You can import data to HeavyDB using the Immerse import wizard. You can upload data from a local delimited file, from an Amazon S3 data source, or from the Data Catalog.

  • If a source file uses a reserved word, Heavy Immerse automatically adds an underscore at the end of the reserved word. For example, year is converted to year_.

  • If you click the Back button (or accidentally two-finger swipe your mousepad) before your data load is complete, HeavyDB stops the data load and any records that had transferred are invalidated.

Importing Non-Geospatial Data from a Local File

Follow these steps to import your data:

  1. Click DATA MANAGER.

  2. Click Import Data.

  3. Click Import data from a local file.

  4. Either click the plus sign (+) or drag your file(s) for upload. If you are uploading multiple files, the column names and data types must match. HEAVY.AI supports only delimiter-separated formats such as CSV and TSV. HEAVY.AI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI. In addition to CSV, TSV, and TXT files, you can import compressed delimited files in TAR, ZIP, 7-ZIP, RAR, GZIP, BZIP2, or TGZ format.

  5. Choose Import Settings:

    • Null string: If, instead using a blank for null cells in your upload document, you have substituted strings such as NULL, enter that string in the Null String field. The values are treated as null values on upload.

    • Delimiter Type: Delimiters are detected automatically. You can choose a specific delimiter, such as a comma, tab, or pipe.

    • Quoted String: Indicate whether your string fields are enclosed by quotes. Delimiter characters inside quotes are ignored.

    • Includes Header Row: HEAVY.AI tries to infer whether the first row contains headers or data (for example, if the first row has only strings and the rest of the table contains number values, the first row is inferred to be headers). If HEAVY.AI infers incorrectly, you have the option of manually indicating whether or not the first row contains headers.

  6. Click Import Files.

  7. The Table Preview screen presents sample rows of imported data. The importer assigns a data type based on sampling, but you should examine and modify the selections as appropriate. Assign the correct data type to ensure optimal performance. Immerse defaults to second precision for all timestamp columns. You can reset the precision to second, millisecond, nanosecond, or microsecond. If your column headers contain SQL reserved words, reserved characters (for example, year, /, or #), or spaces, the importer alters the characters to make them safe and notifies you of the changes. You can also change the column labels.

  8. Name the table, and click Save Table.

Importing Data from Amazon S3

To import data from your Amazon S3 instance, you need:

  • The Region and Path for the file in your S3 bucket, or the direct URL to the file (S3 Link).

  • If importing private data, your Access Key and Secret Key for your personal IAM account in S3.

Locating the Data File S3 Region, Path, and URL

In an S3 bucket, the Region is in the upper-right corner of the screen – US West (N. California) in this case:

Click the file you want to import. To load your S3 file to HEAVY.AI using the steps for S3 Region | Bucket | Path, below, click Copy path to copy to your clipboard the path to your file within your S3 bucket. Alternatively, you can copy the link to your file. The Link in this example is https://s3-us-west-1.amazonaws.com/my-company-bucket/trip_data.7z.

Obtaining Your S3 Access Key and Secret Key

If the data you want to copy is publicly available, you do not need to provide an Access Key and Secret Key.

You can import any file you can see using your IAM account with your Access Key and Secret Key.

Your Secret Key is created with your Access Key, and cannot be retrieved afterward. If you lose your Secret Key, you must create a new Access Key and Secret Key.

Loading Your S3 Data to HEAVY.AI

Follow these steps to import your S3 data:

  1. Click DATA MANAGER.

  2. Click Import Data.

  3. Click Import data from Amazon S3.

  4. Choose whether to import using the S3 Region | Bucket | Path or a direct full link URL to the file (S3 Link).

    1. To import data using S3 Region | Bucket | Path:

      1. Select your Region from the pop-up menu.

      2. Enter the unique name of your S3 Bucket.

      3. Enter or paste the Path to the file stored in your S3 bucket.

    2. To import data using S3 link:

      1. Copy the Link URL from the file Overview in your S3 bucket.

      2. Paste the link in the Full Link URL field of the HEAVY.AI Table Importer.

  5. If the data is publicly available, you can disable the Private Data checkbox. If you are importing Private Data, enter your credentials:

    1. Enable the Private Data checkbox.

    2. Enter your S3 Access Key.

    3. Enter your S3 Secret Key.

  6. Choose the appropriate Import Settings. HEAVY.AI supports only delimiter-separated formats such as CSV and TSV.

    1. Null string: If you have substituted a string such as NULL for null values in your upload document, enter that string in the Null String field. The values are treated as null values on upload.

    2. Delimiter Type: Delimiters are detected automatically. You can choose a specific delimiter, such as a comma or pipe.

    3. Includes Header Row: HEAVY.AI tries to infer whether the first row contains headers or data (for example, if the first row has only strings and the rest of the table contains number values, the first row is inferred to be headers). If HEAVY.AI infers incorrectly, you have the option of manually indicating whether or not the first row contains headers.

    4. Quoted String: Indicate whether your string fields are enclosed by quotes. Delimiter characters inside quotes are ignored.

  7. Click Import Files.

  8. The Table Preview screen presents sample rows of imported data. The importer assigns a data type based on sampling, but you should examine and modify the selections as appropriate. Assign the correct data type to ensure optimal performance. If your column headers contain SQL reserved words, reserved characters (for example, year, /, or #), or spaces, the importer alters the characters to make them safe and notifies you of the changes. You can also change the column labels.

  9. Name the table, and click Save Table.

Importing from the Data Catalog

The Data Catalog provides access to sample datasets you can use to exercise data visualization features in Heavy Immerse. The selection of datasets continually changes, independent of product releases.

To import from the data catalog:

  1. Open the Data Manager.

  2. Click Data Catalog.

  3. Use the Search box to locate a specific data set, or scroll to find the dataset you want to use. The Contains Geo toggle filters for data sets that contain Geographical information.

  4. Click the Import button beneath the dataset you want to use.

  5. Verify the table and column names in the Data Preview screen.

  6. Click Import Data.

Appending Data to a Table

You can append additional data to an existing table.

To append data to a table:

  1. Open Data Manager.

  2. Select the table you want to append.

  3. Click Append Data.

  4. Click Import data from a local file.

  5. Either click the plus sign (+) or drag your file(s) for upload. The column names and data types of the files you select must match the existing table. HEAVY.AI supports only delimiter-separated formats such as CSV and TSV. HEAVY.AI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI. In addition to CSV, TSV, and TXT files, you can import compressed delimited files in TAR, ZIP, 7-ZIP, RAR, GZIP, BZIP2, or TGZ format.

  6. Click Preview.

  7. Click Import Settings

  8. Choose Import Settings:

    • Null string: If, instead using a blank for null cells in your upload document, you have substituted strings such as NULL, enter that string in the Null String field. The values are treated as null values on upload.

    • Delimiter Type: Delimiters are detected automatically. You can choose a specific delimiter, such as a comma, tab, or pipe.

    • Quoted String: Indicate whether your string fields are enclosed by quotes. Delimiter characters inside quotes are ignored.

    • Includes Header Row: HEAVY.AI tries to infer whether the first row contains headers or data (for example, if the first row has only strings and the rest of the table contains number values, the first row is inferred to be headers). If HEAVY.AI infers incorrectly, you have the option of manually indicating whether or not the first row contains headers.

  9. Close Import Settings.

  10. The Data Preview screen presents sample rows of imported data. The importer assigns a data type based on sampling, but you should examine and modify the selections as appropriate. Assign the correct data type to ensure optimal performance.

    If your data contains column headers, verify they match the existing headers.

  11. Click Import Data.

Truncating a Table

Sometimes you might want to remove or replace the data in a table without losing the table definition itself.

To remove all data from a table:

  1. Open Data Manager.

  2. Select the table you want to truncate.

  3. Click Delete All Rows.

  4. A very scary red dialog box reminds you that the operation cannot be undone. Click DELETE TABLE ROWS.

    Immerse displays the table information with a row count of 0.

Deleting a Table

You can drop a table entirely using Data Manager.

To delete a table:

  1. Open Data Manager.

  2. Select the table you want to delete.

  3. Click DELETE TABLE.

  4. A very scary red dialog box reminds you that the operation cannot be undone. Click DELETE TABLE.

    Immerse deletes the table and returns you to the Data Manager TABLES list.

OmniSci less than 5.5

HEAVY.AI 7.0

Upgrade to 5.5 --> --> 7.0

OmniSci 5.5 - 5.10

HEAVY.AI 7.0

Upgrade to --> 7.0

HEAVY.AI 6.0

HEAVY.AI 7.0

Upgrade to 7.0

In some situations, you might not be able to upgrade NVIDIA CUDA drivers on a regular basis. To work around this issue, NVIDIA provides compatibility drivers that allow users to use newer features without requiring a full upgrade. For information about compatibility drivers, see .

If the version of Omnisci is older than 5.5 an intermediate upgrade step to the 5.5 version is needed. Check the docs on how to do the .

Install the HEAVY.AI software following all the instructions for your Operative System. and .

In addition, systemd manages the open-file limit in Linux. Some cloud providers and distributions set this limit too low, which can result in errors as your HEAVY.AI environment and usage grow. For more information about adjusting the limits on open files, see in the section of our knowledgebase.

You can customize the behavior of your HEAVY.AI servers by modifying your heavy.conf configuration file. See .

Enables all join queries to fall back to the loop join implementation. During a loop join, queries loop over all rows from all tables involved in the join, and evaluate the join condition. By default, loop joins are only allowed if the number of rows in the inner table is fewer than the , since loop joins are computationally expensive and run for an extended period. Modifying the trivial-loop-join-threshold is a safer alternative to globally enabling loop joins. You might choose to globally enable loop joins when you have many small tables for which loop join performance has been determined to be acceptable but modifying the trivial join loop threshold would be tedious.

Path to file containing HEAVY.AI queries. Use a query list to autoload data to GPU memory on startup to speed performance. See .

Enable filter push-down through joins. Evaluates filters in the query expression for selectivity and pushes down highly selective filters into the join according to selectivity parameters. See also

Enable the runtime query interrupt. Enables runtime query interrupt. Setting to TRUE can reduce performance slightly. Use with to set the interrupt frequency.

Symbolic link to the active log. Creates a symbolic link for every severity greater than or equal to the configuration option.

Number of GPUs to use. In a shared environment, you can assign the number of GPUs to a particular application. The default, -1, uses all available GPUs. Use in conjunction with .

Path to the file containing trusted CA certificates; for PKI authentication. Used to validate certificates submitted by clients. If the certificate provided by the client (in the password field of the connect command) was not signed by one of the certificates in the trusted file, then the connection fails. PKI authentication works only if the server is configured to encrypt connections via TLS. The common name extracted from the client certificate is used as the name of the user to connect. If this name does not already exist, the connection fails. If LDAP or SAML are also enabled, the servers fall back to these authentication methods if PKI authentication fails. Currently works only with clients. To allow connection from other clients, set allow-local-auth-fallback or add LDAP/SAML authentication.

First GPU to use. Used in shared environments in which the first assigned GPU is not GPU 0. Use in conjunction with .

Compressor algorithm to be used by the server to compress data being transferred between server. See for compression algorithm options.

Enable a new port that heavy_web_server listens on for incoming HTTP requests. When received, it returns a redirect response to the HTTPS port and protocol, so that browsers are immediately and transparently redirected. Use to provide an HEAVY.AI front end that can run on both the HTTP protocol (http://my-heavyai-frontend.com) on default HTTP port 80, and on the primary HTTPS protocol (https://my-heavyai-frontend.com) on default https port 443, and have requests to the HTTP protocol automatically redirected to HTTPS. Without this, requests to HTTP fail. Assuming heavy_web_server can attach to ports below 1024, the configuration would be: enable-https-redirect = TRUE = 80

Configures the http (incoming) port used by . The port option specifies the redirect port number. Use to provide an HEAVY.AI front end that can run on both the HTTP protocol (http://my-heavyai-frontend.com) on default HTTP port 80, and on the primary HTTPS protocol (https://my-heavyai-frontend.com) on default https port 443, and have requests to the HTTP protocol automatically redirected to HTTPS. Without this, requests to HTTP fail. Assuming heavy_web_server can attach to ports below 1024, the configuration would be: enable-https-redirect = TRUE = 80

Generate a 128- or 256-bit encryption key and save it to a file. You can use to generate a suitable encryption key.

Generate encrypted credentials for a custom application by running the following Go program, replacing the example key and credentials strings with an actual key and actual credentials. You can also run the program in a web browser at .

These instructions use as the IdP and HEAVY.AI as the SP in an SP-initiated workflow, similar to the following:

Begin by adding your SAML application in Okta. If you do not have an Okta account, you can sign up on the .

Single sign on URL: Your Heavy Immerse web URL with the suffix saml-post; for example, . Select the Use this for Recipient URL and Destination URL checkbox.

User accounts assigned to the HEAVY.AI application in Okta must exist in HEAVY.AI before a user can log in. To have users created automatically based on their group membership, see .

is a distributed streaming platform. It allows you to create publishers, which create data streams, and consumers, which subscribe to and ingest the data streams produced by publishers.

You can use HeavyDB C++ program to consume a topic created by running Kafka shell scripts from the command line. Follow the procedure below to use a Kafka producer to send data, and a Kafka consumer to store the data, in HeavyDB.

This example assumes you have already installed and configured Apache Kafka. See the .

For methods specific to geospatial data, see also .

If there is a potential for duplicate entries, and you prefer to avoid loading duplicate rows, see .

Replicate Table: If you are importing non-geospatial data to a distributed database with more than one node, select this checkbox to replicate the table to all nodes in the cluster. This effectively adds the PARTITIONS='REPLICATED' option to the create table statement. See .

You can also import locally stored shape files in a variety of formats. See .

For information on opening and reviewing items in your S3 instance, see

To learn about creating your S3 Access Key and Secret Key, see

Replicate Table: If you are importing non-geospatial data to a distributed database with more than one node, select this checkbox to replicate the table to all nodes in the cluster. This effectively adds the PARTITIONS='REPLICATED' option to the create table statement. See .

To append data from AWS, click Append Data, then follow the instructions for .

gpu-code-cache-max-size-in-bytes [=arg]
6.0
6.0
https://docs.nvidia.com/deploy/cuda-compatibility/index.html
upgrade
Why am I seeing the error "Too many open files...erno24"
Troubleshooting and Monitoring Solutions
Configuration Parameters
[web]
encryption-key-file-path = “path/to/file”
package main

import (
    "crypto/aes"
    "crypto/cipher"
    "crypto/rand"
    
    "fmt"
    "io")
    
// 1. Replace example key with encryption string
var key = "v9y$B&E(H+MbQeThWmZq4t7w!z%C*F-J"

// 2. Replace strings "username", "password", "dbName"with credentials
var stringsToBeEncrypted = []string{
    "username",
    "password",
    "dbName",
}

// 3. Run program to see encrypted credentials in console
func main() {
    for i := range stringsToBeEncrypted {
        encrypted, err := EncryptString(stringsToBeEncrypted[i])
        if err != nil {
            panic(err)
        }
        fmt.Printf("%s => %s\n", stringsToBeEncrypted[i],encrypted)
    }
}

func EncryptString(str string) (encrypted string,err error) {
    keyBytes := []byte(key)
    
    block, err := aes.NewCipher(keyBytes)
    if err != nil {
        panic(err.Error())
    }
    aesGCM, err := cipher.NewGCM(block)
    if err != nil {
        panic(err.Error())
    }
    nonce := make([]byte, aesGCM.NonceSize())
    if _, err = io.ReadFull(rand.Reader, nonce); err!= nil {
        panic(err.Error())
    }
    strBytes := []byte(str)
    
    cipherBytes := aesGCM.Seal(nonce, nonce, strBytes,nil)
    
    return fmt.Sprintf("%x", cipherBytes), err
}
saml-metadata-file = "/heavyai-storage/idp.xml"
saml-sp-target-url = "https://tonysingle.com:6273/saml-post"
saml-signed-assertion = true
saml-signed-response = true
[web]
enable-https = true
cert = "/heavyai-storage/ssl/server.crt"
key = "/heavyai-storage/ssl/server.key"
servers-json = "/heavyai-storage/servers.json"
https://heavyai-tony.okta.com/app/heavyaiorg969324_heavyai_2/exk1p0m4blWiBsFiU357/sso/saml
 [
  {
    "enableJupyter": true,
     "url": "tonysingle.com",
     "port": "6273",
    "SAMLurl":"https://heavyai-tony.okta.com/app/heavyaiorg969324_heavyai_2/exk1p0m4blWiBsFiU357/sso/saml"
  }
]
saml-sync-roles = true
saml-metadata-file = "/heavyai-storage/idp.xml"
saml-sp-target-url = "https://tonysingle.com:6273/saml-post"
saml-sync-roles = true
heavysql> \set_license
openssl genrsa -out server.key 2048
openssl req -new -key server.key -out server.csr
openssl x509 -req -days 365 -in server.csr -signkey server.key -out server.crt
keytool -importcert  -file server.crt -keystore server.jks
cp server.key server.txt
cat server.crt >> server.txt
openssl pkcs12 -export -in server.txt -out server.p12
keytool -importkeystore -v -srckeystore server.p12  -srcstoretype PKCS12 -destkeystore keystore.jks -deststoretype pkcs12
--pki-db-client-auth true
--ssl-cert 
--ssl-private-key 
--ssl-trust-store 
--ssl-trust-password 
--ssl-keystore 
--ssl-keystore-password 
--ssl-trust-ca 
--ssl-trust-ca-server 
sudo start heavyai_server --port 6274 --data /data --pki-db-client-auth true  
--ssl-cert /tls_certs/self_signed_server.example.com_self_signed/self_signed_server.example.com.pem 
--ssl-private-key /tls_certs/self_signed_server.example.com_self_signed/private/self_signed_server.example.com_key.pem 
--ssl-trust-store /tls_certs/self_signed_server.example.com_self_signed/trust_store_self_signed_server.example.com.jks 
--ssl-trust-password truststore_password 
--sslkeystore /tls_certs/self_signed_server.example.com_self_signed/key_store_self_signed_server.example.com.jks
--ssl-keystore-password keystore_password 
--ssl-trust-ca = "/tls_certs/self_signed_server.example.com_self_signed/self_signed_server.example.com.pem" 
--ssl-trust-ca-server /tls_certs/ca_primary/ca_primary_cert.pem
# Start pki authentication 
pki-db-client-auth = true 
ssl-cert = "/tls_certs/self_signed_server.example.com_self_signed/self_signed_server.example.com.pem" 
ssl-private-key = "/tls_certs/self_signed_server.example.com_self_signed/private/self_signed_server.example.com_key.pem" 
ssl-trust-store = "/tls_certs/self_signed_server.example.com_self_signed/trust_store_self_signed_server.example.com.jks" 
ssl-trust-password = "truststore_password"  
ssl-keystore = "/tls_certs/self_signed_server.example.com_self_signed/key_store_self_signed_server.example.com.jks" 
ssl-keystore-password = "keystore_password" 
ssl-trust-ca = "/tls_certs/self_signed_server.example.com_self_signed/self_signed_server.example.com.pem" 
ssl-trust-ca-server = "/tls_certs/ca_primary/ca_primary_cert.pem" 

Parameter

Description

Example

ldap-uri

LDAP server host or server URI.

ldap://myLdapServer.myCompany.com

ldap-dn

LDAP distinguished name (DN).

uid=$USERNAME,cn=users,cn=accounts, dc=myCompany,dc=com

ldap-role-query-url

Returns the role names a user belongs to in the LDAP.

ldap://myServer.myCompany.com/uid=$USERNAME, cn=users, cn=accounts,dc=myCompany,dc=com?memberOf

ldap-role-query-regex

Applies a regex filter to find matching roles from the roles in the LDAP server.

(MyCompany_.*?),

ldap-superuser-role

Identifies one of the filtered roles as a superuser role. If a user has this filtered ldap role, the user is marked as a superuser.

MyCompany_SuperUser

$ curl --user "uid=kiran,cn=users,cn=accounts,dc=mycompany,dc=com" 
"ldap://myldapserver.mycompany.com/uid=kiran,cn=users,cn=accounts,dc=mycompany,dc=com?memberOf"
DN: uid=kiran,cn=users,cn=accounts,dc=mycompany,dc=com
memberOf: cn=ipausers,cn=groups,cn=accounts,dc=mycompany,dc=com
memberOf: cn=MyCompany_SuperUser,cn=roles,cn=accounts,dc=mycompany,dc=com
memberOf: cn=test,cn=groups,cn=accounts,dc=mycompany,dc=com
ldap-uri = "ldap://myldapserver.mycompany.com"
ldap-dn = "uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com"
ldap-role-query-url = "ldap://myldapserver.mycompany.com/uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com?memberOf"
ldap-role-query-regex = "(MyCompany_.*?),"
ldap-superuser-role = "MyCompany_SuperUser"
sudo systemctl restart heavyai_server
sudo systemctl restart heavyai_web_server
scp root@myldapserver:/etc/ipa/ca.crt /etc/pki/ca-trust/source/anchors/ipa-ca.pem
update-ca-trust
TLS_CACERT      /etc/pki/tls/certs/ca-bundle.crt
ldap-uri = "ldaps://myldapserver.mycompany.com"
ldap-dn = "uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com"
ldap-role-query-url = "ldaps://myldapserver.mycompany.com/uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com?memberOf"
ldap-role-query-regex = "(MyCompany_.*?),"
ldap-superuser-role = "MyCompany_SuperUser"
sudo systemctl restart heavyaidb
sudo systemctl restart heavyai_web_server
mkdir /usr/local/share/ca-certificates/ipa
scp root@myldapserver:/etc/ipa/ca.crt /usr/local/share/ca-certificates/ipa/ipa-ca.pem
mv /usr/local/share/ca-certificates/ipa/ipa-ca.pem /usr/local/share/ca-certificates/ipa/ipa-ca.crt
update-ca-certificates
TLS_CACERT      /etc/ssl/certs/ca-certificates.crt
ldap-uri = "ldaps://myldapserver.mycompany.com"
ldap-dn = "uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com"
ldap-role-query-url = "ldaps://myldapserver.mycompany.com/uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com?memberOf"
ldap-role-query-regex = "(MyCompany_.*?),"
ldap-superuser-role = "MyCompany_SuperUser"
sudo systemctl restart heavydb
sudo systemctl restart heavyai_web_server
ldap-uri = "ldap://myldapserver.mycompany.com"
ldap-dn = "cn=$USERNAME,cn=users,dc=qa-mycompany,dc=com"
ldap-role-query-url = "ldap:///myldapserver.mycompany.com/cn=$USERNAME,cn=users,dc=qa-mycompany,dc=com?memberOf"
ldap-role-query-regex = "(HEAVYAI_.*?),"
ldap-superuser-role = "HEAVYAI_SuperUser"
ldap-uri = "ldap://myldapserver.mycompany.com"
ldap-dn = "$USERNAME@mycompany.com"
ldap-role-query-url = "ldap:///myldapserver.mycompany.com/OU=MyCompany Users,dc=MyCompany,DC=com?memberOf?sub?(sAMAccountName=$USERNAME)"
ldap-role-query-regex = "(HEAVYAI_.*?),"
ldap-superuser-role = "HEAVYAI_SuperUser"
sudo systemctl restart heavyai_server
sudo systemctl restart heavyai_web_server
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1
--partitions 1 --topic matstream
michael,1
andrew,2
ralph,3
sandhya,4
create table stream1(name text, id int);
cat myfile | bin/kafka-console-producer.sh --broker-list localhost:9097
--topic matstream
/home/heavyai/build/bin/KafkaImporter stream1 heavyai -p HyperInteractive -u heavyai --port 6274 --batch 1 --brokers localhost:6283  
--topic matstream --group-id 1

Field Delimiter: ,
Line Delimiter: \n
Null String: \N
Insert Batch Size: 1
1 Rows Inserted, 0 rows skipped.
2 Rows Inserted, 0 rows skipped.
3 Rows Inserted, 0 rows skipped.
4 Rows Inserted, 0 rows skipped.
heavysql> select * from stream1;
name|id
michael|1
andrew|2
ralph|3
sandhya|4
CREATE TABLE flights … WITH (PARTITIONS='REPLICATED')

Hostname

IP

Role(s)

Node1

10.10.10.1

Leaf, Aggregator

Node2

10.10.10.2

Leaf, String Dictionary Server

Node3

10.10.10.3

Leaf

Node4

10.10.10.4

Leaf

[
  {
    "host": "node1",
    "port": 16274,
    "role": "dbleaf"
  },
  {
    "host": "node2",
    "port": 16274,
    "role": "dbleaf"
  },
 {
    "host": "node3",
    "port": 16274,
    "role": "dbleaf"
  },
  {
    "host": "node4",
    "port": 16274,
    "role": "dbleaf"
  },

  {
    "host": "node2",
    "port": 6277,
    "role": "string"
  }
]
port = 16274
http-port = 16278
calcite-port = 16279
data = "<location>/heavyai-storage/nodeLocal/data"
read-only = false
string-servers = "<location>/heavyai-storage/cluster.conf"
port = 6274
http-port = 6278
calcite-port = 6279
data = "<location>/heavyai-storage/nodeLocalAggregator/data"
read-only = false
num-gpus = 1
cluster = "<location>/heavyai-storage/cluster.conf"

[web]
port = 6273
frontend = "<location>/prod/heavyai/frontend"
How can I avoid creating duplicate rows?

Data Definition (DDL)

SQL Capabilities

Policies

You can use policies to provide row-level security (RLS) in HEAVY.AI.

CREATE POLICY

CREATE POLICY ON COLUMN table.column TO <name> VALUES ('string', 123, ...);

Create an RLS policy for a user or role (<name>); admin rights are required. All queries on the table for the user or role are automatically filtered to include only rows where the column contains any one of the values from the VALUES clause.

RLS filtering works similarly to a WHERE column = value clause, appended to every query or subquery on the table, would work. If policies on multiple columns in the same table are defined for a user or role, then a row is visible to that user or role if any one or more of the policies matches that row.

DROP POLICY

DROP POLICY ON COLUMN table.column FROM <name>;

Drop an RLS policy for a user or role (<name>); admin rights are required. All values specified for the column by the policy are dropped. Effective values from another policy on an inherited role are not dropped.

SHOW POLICIES

SHOW [EFFECTIVE] POLICIES <name>;

Displays a list of all RLS policies that exist for a user or role. If EFFECTIVE is used, the list also include any policies that exist for all roles that apply to the requested user or role.

ALTER SYSTEM CLEAR

Clear CPU, GPU, or RENDER memory. Available to super users only.

ALTER SYSTEM CLEAR (CPU|GPU|RENDER) MEMORY

Examples

ALTER SYSTEM CLEAR CPU MEMORY
ALTER SYSTEM CLEAR GPU MEMORY
ALTER SYSTEM CLEAR RENDER MEMORY

Generally, the server handles memory management, and you do not need to use this command. If you are having unexpected memory issues, try clearing the memory to see if performance improves.

DELETE

Deletes rows that satisfy the WHERE clause from the specified table. If the WHERE clause is absent, all rows in the table are deleted, resulting in a valid but empty table.

DELETE FROM table_name [ * ] [ [ AS ] alias ]
[ WHERE condition ]

Cross-Database Queries

In Release 6.4 and higher, you can run DELETE queries across tables in different databases on the same HEAVY.AI cluster without having to first connect to those databases.

To execute queries against another database, you must have ACCESS privilege on that database, as well as DELETE privilege.

Example

Delete rows from a table in the my_other_db database:

DELETE FROM my_other_db.customers WHERE id > 100;

Data Manipulation (DML)

ALTER SESSION SET

Change a parameter value for the current session.

ALTER SESSION SET <parameter_name>=<parameter_value>
Paremeter name
Values

EXECUTOR_DEVICE

CPU - Set the session to CPU execution mode:

ALTER SESSION SET EXECUTOR_DEVICE='CPU'; GPU - Set the session to GPU execution mode:

ALTER SESSION SET EXECUTOR_DEVICE='GPU'; NOTE: These parameter values have the same effect as the \cpu and \gpu commands in heavysql, but can be used with any tool capable of running sql commands.

CURRENT_DATABASE

Can be set to any string value.

If the value is a valid database name, and the current user has access to it, the session switches to the new database. If the user does not have access or the database does not exist, an error is returned and the session will fall back to the starting database.

Alter Session Examples

CURRENT_DATABASE

Switch to another database without need of re-login.

ALTER SESSION SET CURRENT_DATABASE='owned_database'; 

Your session will silently switch to the requested database.

The database exists, but the user does not have access to it:

ALTER SESSION SET CURRENT_DATABASE='information_schema';
TException - service has thrown: TDBException(error_msg=Unauthorized access: 
user test is not allowed to access database information_schema.)

The database does not exist:

ALTER SESSION SET CURRENT_DATABASE='not_existent_db'; 
TException - service has thrown: TDBException(error_msg=Database name 
not_existent_db does not exist.)

EXECUTOR_DEVICE

Force the session to run the subsequent SQL commands in CPU mode:

ALTER SESSION SET EXECUTOR_DEVICE='CPU';

Switch back the session to run in GPU mode

ALTER SESSION SET EXECUTOR_DEVICE='GPU';
trivial-loop-join-threshold
Preloading Data
What is Predicate Pushdown?
runtime-query-interrupt-frequency
log-severity
start-gpu
JDBC
num-gpus
Data Compression
http-to-https-redirect-port
enable-https-redirect
http-to-https-redirect-port
https://acte.ltd/utils/randomkeygen
https://play.golang.org/p/nNBsZ8dhqr0
Okta
auth0
Ping Identity
Keycloak
Oracle Access Management
Okta web page
https://tonysingle.com:6273/saml-post
Apache Kafka
KafkaImporter
Kafka website
http://blosc.org/pages/synthetic-benchmarks/
https://quixdb.github.io/squash-benchmark/
https://lz4.github.io/lz4/
Importing Geospatial Data Using Immerse
Replicated Tables
Importing Geospatial Data Using Immerse
https://docs.aws.amazon.com/AmazonS3/latest/gsg/OpeningAnObject.html
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey
Auto-Creating Users with SAML
Replicated Tables
Loading S3 Data to HEAVY.AI
CREATE USER

Roles and Privileges

HEAVY.AI supports data security using a set of database object access privileges granted to users or roles.

Users and Privileges

When you create a database, the admin superuser is created by default. The admin superuser is granted all privileges on all database objects. Superusers can create new users that, by default, have no database object privileges.

Superusers can grant users selective access privileges on multiple database objects using two mechanisms: role-based privileges and user-based privileges.

Role-based Privileges

  1. Grant roles access privileges on database objects.

  2. Grant roles to users.

  3. Grant roles to other roles.

User-based Privileges

When a user has privilege requirements that differ from role privileges, you can grant privileges directly to the user. These mechanisms provide data security for many users and classes of users to access the database.

You have the following options for granting privileges:

  • Each object privilege can be granted to one or many roles, or to one or many users.

  • A role and/or user can be granted privileges on one or many objects.

  • A role can be granted to one or many users or other roles.

  • A user can be granted one or many roles.

This supports the following many-to-many relationships:

  • Objects and roles

  • Objects and users

  • Roles and users

These relationships provide flexibility and convenience when granting/revoking privileges to and from users.

Granting object privileges to roles and users, and granting roles to users, has a cumulative effect. The result of several grant commands is a combination of all individual grant commands. This applies to all database object types and to privileges inherited by objects. For example, object privileges granted to the object of database type are propagated to all table-type objects of that database object.

Who Can Grant Object Privileges?

Only a superuser or an object owner can grant privileges for on object.

  • A superuser has all privileges on all database objects.

  • A non-superuser user has only those privileges on a database object that are granted by a superuser.

  • A non-superuser user has ALL privileges on a table created by that user.

Roles and Privileges Persistence

  • Roles can be created and dropped at any time.

  • Object privileges and roles can be granted or revoked at any time, and the action takes effect immediately.

  • Privilege state is persistent and restored if the HEAVY.AI session is interrupted.

Database Object Privileges

There are five database object types, each with its own privileges.

ACCESS - Connect to the database. The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.

ALL - Allow all privileges on this database except issuing grants and dropping the database.

SELECT, INSERT, TRUNCATE, UPDATE, DELETE - Allow these operations on any table in the database.

ALTER SERVER - Alter servers in the current database.

CREATE SERVER - Create servers in the current database.

CREATE TABLE - Create a table in the current database. (Also CREATE.)

CREATE VIEW - Create a view for the current database.

CREATE DASHBOARD - Create a dashboard for the current database.

DELETE DASHBOARD - Delete a dashboard for this database.

DROP SERVER - Drop servers from the current database.

DROP - Drop a table from the database.

DROP VIEW - Drop a view for this database.

EDIT DASHBOARD - Edit a dashboard for this database.

SELECT VIEW - Select a view for this database.

SERVER USAGE - Use servers (through foreign tables) in the current database.

VIEW DASHBOARD - View a dashboard for this database.

VIEW SQL EDITOR - Access the SQL Editor in Immerse for this database.

Users with SELECT privilege on views do not require SELECT privilege on underlying tables referenced by the view to retrieve the data queried by the view. View queries work without error whether or not users have direct access to referenced tables. This also applies to views that query tables in other databases.

To create views, users must have SELECT privilege on queried tables in addition to the CREATE VIEW privilege.

SELECT, INSERT, TRUNCATE, UPDATE, DELETE - Allow these SQL statements on this table.

DROP - Drop this table.

Users with SELECT privilege on views do not require SELECT privilege on underlying tables referenced by the view to retrieve the data queried by the view. View queries work without error whether or not users have direct access to referenced tables. This also applies to views that query tables in other databases.

To create views, users must have SELECT privilege on queried tables in addition to the CREATE VIEW privilege.

SELECT - Select from this view. Users do not need privileges on objects referenced by this view.

DROP - Drop this view.

Users with SELECT privilege on views do not require SELECT privilege on underlying tables referenced by the view to retrieve the data queried by the view. View queries work without error whether or not users have direct access to referenced tables. This also applies to views that query tables in other databases.

To create views, users must have SELECT privilege on queried tables in addition to the CREATE VIEW privilege.

VIEW - View this dashboard.

EDIT - Edit this dashboard.

DELETE - Delete this dashboard.

DROP - Drop this server from the current database.

ALTER - Alter this server in the current database.

USAGE - Use this server (through foreign tables) in the current database.

Privileges granted on a database-type object are inherited by all tables of that database.

Privilege Commands

SQL

Description

Create role.

Drop role.

Grant role to user or to another role.

Revoke role from user or from another role.

Grant role privilege(s) on a database table to a role or user.

Revoke role privilege(s) on database table from a role or user.

Grant role privilege(s) on a database view to a role or user.

Revoke role privilege(s) on database view from a role or user.

Grant role privilege(s) on database to a role or user.

Revoke role privilege(s) on database from a role or user.

Grant role privilege(s) on server to a role or user.

Revoke role privilege(s) on server from a role or user.

Grant role privilege(s) on dashboard to a role or user.

Revoke role privilege(s) on dashboard from a role or user.

Example

The following example shows a valid sequence for granting access privileges to non-superuser user1 by granting a role to user1 and by directly granting a privilege. This example presumes that table1 and user1 already exist, and that user1 has ACCESS privileges on the database where table1 exists.

  1. Create the r_select role.

  2. Grant the SELECT privilege on table1 to the r_select role. Any user granted the r_select role gains the SELECT privilege.

    GRANT SELECT ON TABLE table1 TO r_select;
  3. Grant the r_select role to user1, giving user1 the SELECT privilege on table1.

  4. Directly grant user1 the INSERT privilege on table1.

    GRANT INSERT ON TABLE table1 TO user1;

CREATE ROLE

Create a role. Roles are granted to users for role-based database object access.

This clause requires superuser privilege and <roleName> must not exist.

Synopsis

CREATE ROLE <roleName>;

Parameters

<roleName>

Name of the role to create.

Example

Create a payroll department role called payrollDept.

CREATE ROLE payrollDept;

See Also

DROP ROLE

Remove a role.

This clause requires superuser privilege and <roleName> must exist.

Synopsis

DROP ROLE [IF EXISTS] <roleName>;

Parameters

<roleName>

Name of the role to drop.

Example

Remove the payrollDept role.

DROP ROLE payrollDept;

See Also

GRANT

Grant role privileges to users and to other roles.

The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.

This clause requires superuser privilege. The specified <roleNames> and <userNames> must exist.

Synopsis

GRANT <roleNames> TO <userNames>, <roleNames>;

Parameters

<roleNames>

Names of roles to grant to users and other roles. Use commas to separate multiple role names.

<userNames>

Names of users. Use commas to separate multiple user names.

Examples

Assign payrollDept role privileges to user dennis.

GRANT payrollDept TO dennis;

Grant payrollDept and accountsPayableDept role privileges to users dennis and mike and role hrDept.

GRANT payrollDept, accountsPayableDept TO dennis, mike, hrDept;

See Also

REVOKE

Remove role privilege from users or from other roles. This removes database object access privileges granted with the role.

This clause requires superuser privilege. The specified <roleNames> and <userNames> must exist.

Synopsis

REVOKE <roleNames> FROM <userNames>, <roleNames>;

Parameters

<roleNames>

Names of roles to remove from users and other roles. Use commas to separate multiple role names.

<userName>

Names of the users. Use commas to separate multiple user names.

Example

Remove payrollDept role privileges from user dennis.

REVOKE payrollDept FROM dennis;

Revoke payrollDept and accountsPayableDept role privileges from users dennis and fred and role hrDept.

REVOKE payrollDept, accountsPayableDept FROM dennis, fred, hrDept;

See Also

GRANT ON TABLE

Define the privilege(s) a role or user has on the specified table. You can specify any combination of the INSERT, SELECT, DELETE, UPDATE, DROP, or TRUNCATE privilege or specify all privileges.

The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.

This clause requires superuser privilege, or <tableName> must have been created by the user invoking this command. The specified <tableName> and users or roles defined in <entityList> must exist.

Synopsis

GRANT <privilegeList> ON TABLE <tableName> TO <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Grant all possible access privileges on <tableName> to <entityList>.

ALTER TABLE

Grant ALTER TABLE privilege on <tableName> to <entityList>.

DELETE

Grant DELETE privilege on <tableName> to <entityList>.

DROP

Grant DROP privilege on <tableName> to <entityList>.

INSERT

Grant INSERT privilege on <tableName> to <entityList>.

SELECT

Grant SELECT privilege on <tableName> to <entityList>.

TRUNCATE

Grant TRUNCATE privilege on <tableName> to <entityList>.

UPDATE

Grant UPDATE privilege on <tableName> to <entityList>.

<tableName>

Name of the database table.

<entityList>

Name of entity or entities to be granted the privilege(s).

Parameter Value

Descriptions

role

Name of role.

user

Name of user.

Examples

Permit all privileges on the employees table for the payrollDept role.

GRANT ALL ON TABLE employees TO payrollDept;

Permit SELECT-only privilege on the employees table for user chris.

GRANT SELECT ON TABLE employees TO chris;

Permit INSERT-only privilege on the employees table for the hrdept and accountsPayableDept roles.

GRANT INSERT ON TABLE employees TO hrDept, accountsPayableDept;

Permit INSERT, SELECT, and TRUNCATE privileges on the employees table for the role hrDept and for users dennis and mike.

GRANT INSERT, SELECT, TRUNCATE ON TABLE employees TO hrDept, dennis, mike;

See Also

REVOKE ON TABLE

Remove the privilege(s) a role or user has on the specified table. You can remove any combination of the INSERT, SELECT, DELETE, UPDATE, or TRUNCATE privileges, or remove all privileges.

This clause requires superuser privilege or <tableName> must have been created by the user invoking this command. The specified <tableName> and users or roles in <entityList> must exist.

Synopsis

REVOKE <privilegeList> ON TABLE <tableName> FROM <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Remove all access privilege for <entityList> on <tableName>.

ALTER TABLE

Remove ALTER TABLE privilege for <entityList> on <tableName>.

DELETE

Remove DELETE privilege for <entityList> on <tableName>.

DROP

Remove DROP privilege for <entityList> on <tableName>.

INSERT

Remove INSERT privilege for <entityList> on <tableName>.

SELECT

Remove SELECT privilege for <entityList> on <tableName>.

TRUNCATE

Remove TRUNCATE privilege for <entityList> on <tableName>.

UPDATE

Remove UPDATE privilege for <entityList> on <tableName>.

<tableName>

Name of the database table.

<entityList>

Name of entities to be denied the privilege(s).

Parameter Value

Descriptions

role

Name of role.

user

Name of user.

Example

Prohibit SELECT and INSERT operations on the employees table for the nonemployee role.

REVOKE ALL ON TABLE employees FROM nonemployee;

Prohibit SELECT operations on the directors table for the employee role.

REVOKE SELECT ON TABLE directors FROM employee;

Prohibit INSERT operations on the directors table for role employee and user laura.

REVOKE INSERT ON TABLE directors FROM employee, laura;

Prohibit INSERT, SELECT, and TRUNCATE privileges on the employees table for the role nonemployee and for users dennis and mike.

REVOKE INSERT, SELECT, TRUNCATE ON TABLE employees FROM nonemployee, dennis, mike;

See Also

GRANT ON VIEW

Define the privileges a role or user has on the specified view. You can specify any combination of the SELECT, INSERT, or DROP privileges, or specify all privileges.

This clause requires superuser privileges, or <viewName> must have been created by the user invoking this command. The specified <viewName> and users or roles in <entityList> must exist.

Synopsis

GRANT <privilegeList> ON VIEW <viewName> TO <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Grant all possible access privileges on <viewName> to <entityList>.

DROP

Grant DROP privilege on <viewName> to <entityList>.

INSERT

Grant INSERT privilege on <viewName> to <entityList>.

SELECT

Grant SELECT privilege on <viewName> to <entityList>.

<viewName>

Name of the database view.

<entityList>

Name of entities to be granted the privileges.

Parameter Value

Descriptions

role

Name of role.

user

Name of user.

Examples

Permit SELECT, INSERT, and DROP privileges on the employees view for the payrollDept role.

GRANT ALL ON VIEW employees TO payrollDept;

Permit SELECT-only privilege on the employees view for the employee role and user venkat.

GRANT SELECT ON VIEW employees TO employee, venkat;

Permit INSERT and DROP privileges on the employees view for the hrDept and acctPayableDept roles and users simon and dmitri.

GRANT INSERT, DROP ON VIEW employees TO hrDept, acctPayableDept, simon, dmitri;

See Also

REVOKE ON VIEW

Remove the privileges a role or user has on the specified view. You can remove any combination of the INSERT, DROP, or SELECT privileges, or remove all privileges.

This clause requires superuser privilege, or <viewName> must have been created by the user invoking this command. The specified <viewName> and users or roles in <entityList> must exist.

Synopsis

REVOKE <privilegeList> ON VIEW <viewName> FROM <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Remove all access privilege for <entityList> on <viewName>.

DROP

Remove DROP privilege for <entityList> on <viewName>.

INSERT

Remove INSERT privilege for <entityList> on <viewName>.

SELECT

Remove SELECT privilege for <entityList> on <viewName>.

<viewName>

Name of the database view.

<entityList>

Name of entity to be denied the privilege(s).

Parameter Value

Descriptions

role

Name of role.

user

Name of user.

Example

Prohibit SELECT, DROP, and INSERT operations on the employees view for the nonemployee role.

REVOKE ALL ON VIEW employees FROM nonemployee;

Prohibit SELECT operations on the directors view for the employee role.

REVOKE SELECT ON VIEW directors FROM employee;

Prohibit INSERT and DROP operations on the directors view for the employee and manager role and for users ashish and lindsey.

REVOKE INSERT, DROP ON VIEW directors FROM employee, manager, ashish, lindsey;

See Also

GRANT ON DATABASE

Define the valid privileges a role or user has on the specified database. You can specify any combination of privileges, or specify all privileges.

The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.

This clause requires superuser privileges.

Synopsis

GRANT <privilegeList> ON DATABASE <dbName> TO <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ACCESS

Grant ACCESS (connection) privilege on <dbName> to <entityList>.

ALL

Grant all possible access privileges on <dbName> to <entityList>.

ALTER TABLE

Grant ALTER TABLE privilege on <dbName> to <entityList>.

ALTER SERVER

Grant ALTER SERVER privilege on <dbName> to <entityList>.

CREATE SERVER

Grant CREATE SERVER privilege on <dbName> to <entityList>;

CREATE TABLE

Grant CREATE TABLE privilege on <dbName> to <entityList>. Previously CREATE.

CREATE VIEW

Grant CREATE VIEW privilege on <dbName> to <entityList>.

CREATE DASHBOARD

Grant CREATE DASHBOARD privilege on <dbName> to <entityList>.

CREATE

Grant CREATE privilege on <dbName> to <entityList>.

DELETE

Grant DELETE privilege on <dbName> to <entityList>.

DELETE DASHBOARD

Grant DELETE DASHBOARD privilege on <dbName> to <entityList>.

DROP

Grant DROP privilege on <dbName> to <entityList>.

DROP SERVER

Grant DROP privilege on <dbName> to <entityList>.

DROP VIEW

Grant DROP VIEW privilege on <dbName> to <entityList>.

EDIT DASHBOARD

Grant EDIT DASHBOARD privilege on <dbName> to <entityList>.

INSERT

Grant INSERT privilege on <dbName> to <entityList>.

SELECT

Grant SELECT privilege on <dbName> to <entityList>.

SELECT VIEW

Grant SELECT VIEW privilege on <dbName> to <entityList>.

SERVER USAGE

Grant SERVER USAGE privilege on <dbName> to <entityList>.

TRUNCATE

Grant TRUNCATE privilege on <dbName> to <entityList>.

UPDATE

Grant UPDATE privilege on <dbName> to <entityList>.

VIEW DASHBOARD

Grant VIEW DASHBOARD privilege on <dbName> to <entityList>.

VIEW SQL EDITOR

Grant VIEW SQL EDITOR privilege in Immerse on <dbName> to <entityList>.

<dbName>

Name of the database, which must exist, created by CREATE DATABASE.

<entityList>

Name of the entity to be granted the privilege.

Parameter Value

Descriptions

role

Name of role, which must exist.

user

Examples

Permit all operations on the companydb database for the payrollDept role and user david.

GRANT ALL ON DATABASE companydb TO payrollDept, david;

Permit SELECT-only operations on the companydb database for the employee role.

GRANT ACCESS, SELECT ON DATABASE companydb TO employee;

Permit INSERT, UPDATE, and DROP operations on the companydb database for the hrdept and manager role and for users irene and stephen.

GRANT ACCESS, INSERT, UPDATE, DROP ON DATABASE companydb TO hrdept, manager, irene, stephen;

See Also

REVOKE ON DATABASE

Remove the operations a role or user can perform on the specified database. You can specify privileges individually or specify all privileges.

This clause requires superuser privilege or the user must own the database object. The specified <dbName> and roles or users in <entityList> must exist.

Synopsis

REVOKE <privilegeList> ON DATABASE <dbName> FROM <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ACCESS

Remove ACCESS (connection) privilege on <dbName> from <entityList>.

ALL

Remove all possible privileges on <dbName> from <entityList>.

ALTER SERVER

Remove ALTER SERVER privilege on <dbName> from <entityList>

ALTER TABLE

Remove ALTER TABLE privilege on <dbName> from <entityList>.

CREATE TABLE

Remove CREATE TABLE privilege on <dbName> from <entityList>. Previously CREATE.

CREATE VIEW

Remove CREATE VIEW privilege on <dbName> from <entityList>.

CREATE DASHBOARD

Remove CREATE DASHBOARD privilege on <dbName> from <entityList>.

CREATE

Remove CREATE privilege on <dbName> from <entityList>.

CREATE SERVER

Remove CREATE SERVER privilege on <dbName> from <entityList>.

DELETE

Remove DELETE privilege on <dbName> from <entityList>.

DELETE DASHBOARD

Remove DELETE DASHBOARD privilege on <dbName> from <entityList>.

DROP

Remove DROP privilege on <dbName> from <entityList>.

DROP SERVER

Remove DROP SERVER privilege on <dbName> from <entityList>.

DROP VIEW

Remove DROP VIEW privilege on <dbName> from <entityList>.

EDIT DASHBOARD

Remove EDIT DASHBOARD privilege on <dbName> from <entityList>.

INSERT

Remove INSERT privilege on <dbName> from <entityList>.

SELECT

Remove SELECT privilege on <dbName> from <entityList>.

SELECT VIEW

Remove SELECT VIEW privilege on <dbName> from <entityList>.

SERVER USAGE

Remove SERVER USAGE privilege on <dbName> from <entityList>.

TRUNCATE

Remove TRUNCATE privilege on <dbName> from <entityList>.

UPDATE

Remove UPDATE privilege on <dbName> from <entityList>.

VIEW DASHBOARD

Remove VIEW DASHBOARD privilege on <dbName> from <entityList>.

VIEW SQL EDITOR

Remove VIEW SQL EDITOR privilege in Immerse on <dbName> from <entityList>.

<dbName>

Name of the database.

<entityList>

Parameter Value

Descriptions

role

Name of role.

user

Name of user.

Example

Prohibit all operations on the employees database for the nonemployee role.

REVOKE ALL ON DATABASE employees FROM nonemployee;

Prohibit SELECT operations on the directors database for the employee role and for user monica.

REVOKE SELECT ON DATABASE directors FROM employee;

Prohibit INSERT, DROP, CREATE, and DELETE operations on the directors database for employee role and for users max and alex.

REVOKE INSERT, DROP, CREATE, DELETE ON DATABASE directors FROM employee;

See Also

GRANT ON SERVER

Define the valid privileges a role or user has for working with servers. You can specify any combination of privileges or specify all privileges.

This clause requires superuser privileges, or <serverName> must have been created by the user invoking the command.

Synopsis

GRANT <privilegeList> ON SERVER <serverName> TO <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

DROP

Grant DROP privileges on <serverName> on current database to <entityList>.

ALTER

Grant ALTER privilege on <serverName> on current database to <entityList>.

USAGE

Grant USAGE privilege (through foreign tables) on <serverName> on current database to <entityList>.

<serverName>

Name of the server, which must exist on the current database, created by CREATE SERVER ON DATABASE.

<entityList>

Parameter Value

Descriptions

role

Name of role, which must exist.

user

Examples

Grant DROP privilege on server parquet_s3_server to user fred:

GRANT DROP ON SERVER parquet_s3_server TO fred

Grant ALTER privilege on server parquet_s3_server to role payrollDept:

GRANT ALTER ON SERVER parquet_s3_server TO payrollDept;

Grant USAGE and ALTER privileges on server parquet_s3_server to role payrollDept and user jamie:

GRANT USAGE, ALTER ON SERVER parquet_s3_server TO payrollDept, jamie;

See Also

REVOKE ON SERVER

Remove privileges a role or user has for working with servers. You can specify any combination of privileges or specify all privileges.

This clause requires superuser privileges, or <serverName> must have been created by the user invoking the command.

Synopsis

REVOKE <privilegeList> ON SERVER <serverName> FROM <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

DROP

Remove DROP privileges on <serverName> on current database for <entityList>.

ALTER

Remove ALTER privilege on <serverName> on current database for <entityList>.

USAGE

Remove USAGE privilege (through foreign tables) on <serverName> on current database for <entityList>.

<serverName>

Name of the server, which must exist on the current database, created by CREATE SERVER ON DATABASE.

<entityList>

Parameter Value

Descriptions

role

Name of role, which must exist.

user

Examples

Revoke DROP privilege on server parquet_s3_server for user inga:

REVOKE DROP ON SERVER parquet_s3_server FROM inga

Grant ALTER privilege on server parquet_s3_server for role payrollDept:

REVOKE ALTER ON SERVER parquet_s3_server FROM payrollDept;

Grant USAGE and ALTER privileges on server parquet_s3_server for role payrollDept and user marvin:

REVOKE USAGE, ALTER ON SERVER parquet_s3_server FROM payrollDept, marvin;

See Also

GRANT ON DASHBOARD

Define the valid privileges a role or user has for working with dashboards. You can specify any combination of privileges or specify all privileges.

This clause requires superuser privileges.

Synopsis

GRANT <privilegeList> [ON DASHBOARD <dashboardId>] TO <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Grant all possible access privileges on <dashboardId> to <entityList>.

CREATE

Grant CREATE privilege to <entityList>.

DELETE

Grant DELETE privilege on <dashboardId> to <entityList>.

EDIT

Grant EDIT privilege on <dashboardId> to <entityList>.

VIEW

Grant VIEW privilege on <dashboardId> to <entityList>.

<dashboardId>

ID of the dashboard, which must exist, created by CREATE DASHBOARD. To show a list of all dashboards and IDs in heavysql, run the \dash command when logged in as superuser.

<entityList>

Parameter Value

Descriptions

role

Name of role, which must exist.

user

Examples

Permit all privileges on the dashboard ID 740 for the payrollDept role.

GRANT ALL ON DASHBOARD 740 TO payrollDept;

Permit VIEW-only privilege on dashboard 730 for the hrDept role and user dennis.

GRANT VIEW ON DASHBOARD 730 TO hrDept, dennis;

Permit EDIT and DELETE privileges on dashboard 740 for the hrDept and accountsPayableDept roles and for user pavan.

GRANT EDIT, DELETE ON DASHBOARD 740 TO hrdept, accountsPayableDept, pavan;

See Also

REVOKE ON DASHBOARD

Remove privileges a role or user has for working with dashboards. You can specify any combination of privileges, or all privileges.

This clause requires superuser privileges.

Synopsis

REVOKE <privilegeList> [ON DASHBOARD <dashboardId>] FROM <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Revoke all possible access privileges on <dashboardId> for <entityList>.

CREATE

Revoke CREATE privilege for <entityList>.

DELETE

Revoke DELETE privilege on <dashboardId> for <entityList>.

EDIT

Revoke EDIT privilege on <dashboardId> for <entityList>.

VIEW

Revoke VIEW privilege on <dashboardId> for <entityList>.

<dashboardId>

ID of the dashboard, which must exist, created by CREATE DASHBOARD.

<entityList>

Parameter Value

Descriptions

role

Name of role, which must exist.

user

Revoke DELETE privileges on dashboard 740 for the payrollDept role.

REVOKE DELETE ON DASHBOARD 740 FROM payrollDept;

Revoke all privileges on dashboard 730 for hrDept role and users dennis and mike.

REVOKE ALL ON DASHBOARD 730 FROM hrDept, dennis, mike;

Revoke EDIT and DELETE of dashboard 740 for the hrDept and accountsPayableDept roles and for users dante and jonathan.

REVOKE EDIT, DELETE ON DASHBOARD 740 FROM hrdept, accountsPayableDept, dante, jonathan;

See Also

Common Privilege Levels for Non-Superusers

The following privilege levels are typically recommended for non-superusers in Immerse. Privileges assigned for users in your organization may vary depending on access requirements.

Privilege

Command Syntax to Grant Privilege

Access a database

GRANT ACCESS ON DATABASE <dbName> TO <entityList>;

Create a table

GRANT CREATE TABLE ON DATABASE <dbName> TO <entityList>;

Select a table

GRANT SELECT ON TABLE <tableName> TO <entityList>;

View a dashboard

GRANT VIEW ON DASHBOARD <dashboardId> TO <entityList>;

Create a dashboard

GRANT CREATE DASHBOARD ON DATABASE <dbName> TO <entityList>;

Edit a dashboard

GRANT EDIT ON DASHBOARD TO ;

Delete a dashboard

GRANT DELETE DASHBOARD ON DATABASE <dbName> TO <entityList>;

Example: Roles and Privileges

These examples assume that tables table1 through table4 are created as needed:

create table table1 (id smallint);
create table table2 (id smallint);
create table table3 (id smallint);
create table table4 (id smallint);

The following examples show how to work with users, roles, tables, and dashboards.

Create User Accounts

create user marketingDeptEmployee1 (password = 'md1');
create user marketingDeptEmployee2 (password = 'md2');
create user marketingDeptManagerEmployee3 (password = 'md3');

create user salesDeptEmployee1 (password = 'sd1');
create user salesDeptEmployee2 (password = 'sd2');
create user salesDeptEmployee3 (password = 'sd3');
create user salesDeptEmployee4 (password = 'sd4');
create user salesDeptManagerEmployee5 (password = 'sd5');

Grant Access to Users on Database

grant access on database heavyai to marketingDeptEmployee1, marketingDeptEmployee2, marketingDeptManagerEmployee3;
grant access on database heavyai to salesDeptEmployee1, salesDeptEmployee2, salesDeptEmployee3, salesDeptEmployee4, salesDeptManagerEmployee5;

Create Marketing Department Roles

create role marketingDeptRole1;
create role marketingDeptRole2;

Grant Marketing Department Roles to Marketing Department Employees

grant marketingDeptRole1 to marketingDeptEmployee1, marketingDeptManagerEmployee3;
grant marketingDeptRole2 to marketingDeptEmployee2, marketingDeptManagerEmployee3;

Grant Privelege to Marketing Department Roles

grant select on table table1 to marketingDeptRole1;
grant select on table table2 to marketingDeptRole1;
grant select on table table2 to marketingDeptRole2;

Create Sales Department Roles

create role salesDeptRole1;
create role salesDeptRole2;
create role salesDeptRole3;

Grant Sales Department Roles to Sales Department Employees

grant salesDeptRole1 to salesDeptEmployee1;
grant salesDeptRole2 to salesDeptEmployee2, salesDeptEmployee3;
grant salesDeptRole3 to salesDeptEmployee4;

Grant Privilege to Sales Department Roles

grant select on table table1 to salesDeptRole1;
grant select on table table3 to salesDeptRole1, salesDeptRole2;
grant select on table table4 to salesDeptRole3;

Grant All Sales Roles to Sales Department Manager and Marketing Department Manager

grant salesDeptRole1, salesDeptRole2, salesDeptRole3 to salesDeptManagerEmployee5, marketingDeptManagerEmployee3;

Grant View on Dashboards

Use the \dash command to list all dashboards and their unique IDs in HEAVY.AI:

heavysql> \dash 
Dashboard ID | Dashboard Name    | Owner 
1            | Marketing_Summary | heavyai

Here, the Marketing_Summary dashboard uses table2 as a data source. The role marketingDeptRole2 has select privileges on that table. Grant view access on the Marketing_Summary dashboard to marketingDeptRole2:

grant view on dashboard 1 to marketingDeptRole2;

Relationships Between Users, Roles, and Tables

The following table shows the roles and privileges for each user created in the previous example.

User

Roles Granted

Table Privileges

salesDeptEmployee1

salesDeptRole1

SELECT on Tables 1, 3

salesDeptEmployee2

salesDeptRole2

SELECT on Table 3

salesDeptEmployee3

salesDeptRole2

SELECT on Table 3

salesDeptEmployee4

salesDeptRole3

SELECT on Table 4

salesDeptManagerEmployee5

salesDeptRole1, salesDeptRole2, salesDeptRole3

SELECT on Tables 1, 3, 4

marketingDeptEmployee1

marketingDeptRole1

SELECT on Tables 1, 2

marketingDeptEmployee2

marketingDeptRole2

SELECT on Table 2

marketingDeptManagerEmployee3

marketingDeptRole1, marketingDeptRole2, salesDeptRole1, salesDeptRole2, salesDeptRole3

SELECT on Tables 1, 2, 3, 4

Commands to Report Roles and Privileges

Use the following commands to list current roles and assigned privileges. If you have superuser access, you can see privileges for all users. Otherwise, you can see only those roles and privileges for which you have access.

Results for users, roles, privileges, and object privileges are returned in creation order.

\dash

Lists all dashboards and dashboard IDs in HEAVY.AI. Requires superuser privileges. Dashboard privileges are assigned by dashboard ID because dashboard names may not be unique.

Example

heavysql> \dash database heavyai 
Dashboard ID | Dashboard Name    | Owner 
1            | Marketing_Summary | heavyai

heavysql> \dash database heavyai Dashboard ID | Dashboard Name | Owner 1 | Marketing_Summary | heavyai

\object_privileges objectType `_objectName`_

Reports all privileges granted to the specified object for all roles and users. If the specified objectName does not exist, no results are reported. Used for databases and tables only.

Example

heavysql> \object_privileges database heavyai 
marketingDeptEmployee1 privileges: login-access 
marketingDeptEmployee2 privileges: login-access marketingDeptManagerEmployee3 privileges: login-access
salesDeptEmployee1 privileges: login-access 
salesDeptEmployee2 privileges: login-access 
salesDeptEmployee3 privileges: login-access 
salesDeptEmployee4 privileges: login-access 
salesDeptManagerEmployee5 privileges: login-access

\privileges roleName | userName

Reports all object privileges granted to the specified role or user. The roleName or userName specified must exist.

Example

heavysql> \privileges salesDeptRole1 
table1 (table): select 
table3 (table): select
heavysql> \privileges salesDeptManagerEmployee5 
mapd (database): login-access

heavysql> \privileges marketingdeptrole2 
table2 (table): select
Marketing_Summary (dashboard): view

\role_list userName

Reports all roles granted to the given user. The userName specified must exist.

Example

heavysql> \role_list salesDeptManagerEmployee5
salesDeptRole3 
salesDeptRole2 
salesDeptRole1

\roles

Reports all roles.

Example

heavysql> \roles
marketingDeptRole1 
marketingDeptRole2 
salesDeptRole1 
salesDeptRole2 
salesDeptRole3

\u

Lists all users.

Example

heavysql> \u 
heavyai 
marketingDeptEmployee1 
marketingDeptEmployee2 
salesDeptEmployee1 
salesDeptEmployee2 
salesDeptEmployee3 
salesDeptEmployee4 
salesDeptManagerEmployee5 
marketingDeptManagerEmployee3

Example: Data Security

The following example demonstrates field-level security using two views:

  • view_users_limited, in which users only see three of seven fields: userid, First_Name, and Department.

  • view_users_full, users see all seven fields.

Source Data

Create Views

create view view_users_limited as select userid, First_Name, Department from users;
create view view_users_full as select userid, First_Name, Department, Address, City, State, Zip from users;

Create Users

create user readonly1 (password = 'rr1');
create user readonly2 (password = 'rr2');

Grant Access to Users on Database

grant access on database heavyai to readonly1, readonly2;

Create Roles

create role limited_readonly;
create role full_readonly;

Grant Roles to Users

grant limited_readonly to readonly1;
grant full_readonly to readonly2;

Grant Privilege to View Roles

grant select on view view_users_limited to limited_readonly;
grant select on view view_users_full TO full_readonly;

Verify Views

User readonly1 sees no tables, only the specific view granted, and only the three specific columns returned in the view:

heavysql> \t
heavysql> \v
view_users_limited
heavysql> select * from view_users_limited;
userid|First_Name|Department
1|Todd|C Suite
2|Don|Sales
3|Mike|Customer Success

User readonly2 sees no tables, only the specific view granted, and all seven columns returned in the view:

heavysql> \t
heavysql> \v
view_users_full
heavysql> select * from view_users_full;
userid|First_Name|Department|Address|City|State|Zip
1|Todd|C Suite|1 Front Street|San Francisco|CA|94111
2|Don|Sales|1 5th Avenue|New York|NY|10001
3|Mike|Customer Succes|100 Main Street|Reston|VA|20191

Datatypes

Datatypes and Fixed Encoding

This topic describes standard datatypes and space-saving variations for values stored in HEAVY.AI.

Datatypes

Datatypes, variations, and sizes are described in the following table.

Datatype

Size (bytes)

Notes

BIGINT

8

Minimum value: -9,223,372,036,854,775,807; maximum value: 9,223,372,036,854,775,807.

BIGINT ENCODING FIXED(8)

1

Minimum value: -127; maximum value: 127

BIGINT ENCODING FIXED(16)

2

Same as SMALLINT.

BIGINT ENCODING FIXED(32)

4

Same as INTEGER.

BOOLEAN

1

TRUE: 'true', '1', 't'. FALSE: 'false', '0', 'f'. Text values are not case-sensitive.

4

Same as DATE ENCODING DAYS(32).

DATE ENCODING DAYS(16)

2

Range in days: -32,768 - 32,767 Range in years: +/-90 around epoch, April 14, 1880 - September 9, 2059. Minumum value: -2,831,155,200; maximum value: 2,831,068,800. Supported formats when using COPY FROM: mm/dd/yyyy, dd-mmm-yy, yyyy-mm-dd, dd/mmm/yyyy.

DATE ENCODING DAYS(32)

4

Range in years: +/-5,883,517 around epoch. Maximum date January 1, 5885487 (approximately). Minimum value: -2,147,483,648; maximum value: 2,147,483,647. Supported formats when using COPY FROM: mm/dd/yyyy, dd-mmm-yy, yyyy-mm-dd, dd/mmm/yyyy.

DATE ENCODING FIXED(16)

2

In DDL statements defaults to DATE ENCODING DAYS(16). Deprecated.

DATE ENCODING FIXED(32)

4

In DDL statements defaults to DATE ENCODING DAYS(16). Deprecated.

DECIMAL

2, 4, or 8

Takes precision and scale parameters: DECIMAL(precision,scale)

Size depends on precision:

  • Up to 4: 2 bytes

  • 5 to 9: 4 bytes

  • 10 to 18 (maximum): 8 bytes

Scale must be less than precision.

DOUBLE

8

Variable precision. Minimum value: -1.79e308; maximum value: 1.79e308

EPOCH

8

Seconds ranging from -30610224000 (1/1/1000 00:00:00) through 185542587100800 (1/1/5885487 23:59:59).

FLOAT

4

Variable precision. Minimum value: -3.4e38; maximum value: 3.4e38.

INTEGER

4

Minimum value: -2,147,483,647; maximum value: 2,147,483,647.

INTEGER ENCODING FIXED(8)

1

Minumum value: -127; maximum value: 127.

INTEGER ENCODING FIXED(16)

2

Same as SMALLINT.

LINESTRING

Variable[2]

Geospatial datatype. A sequence of 2 or more points and the lines that connect them. For example: LINESTRING(0 0,1 1,1 2)

MULTILINESTRING

Variable[2]

Geospatial datatype. A set of associated lines. For example: MULTILINESTRING((0 0, 1 0, 2 0), (0 1, 1 1, 2 1))

MULTIPOINT

Variable[2]

Geospatial datatype. A set of points. For example: MULTIPOINT((0 0), (1 0), (2 0))

MULTIPOLYGON

Variable[2]

Geospatial datatype. A set of one or more polygons. For example:MULTIPOLYGON(((0 0,4 0,4 4,0 4,0 0),(1 1,2 1,2 2,1 2,1 1)), ((-1 -1,-1 -2,-2 -2,-2 -1,-1 -1)))

POINT

Variable[2]

Geospatial datatype. A point described by two coordinates. When the coordinates are longitude and latitude, HEAVY.AI stores longitude first, and then latitude. For example: POINT(0 0)

POLYGON

Variable[2]

Geospatial datatype. A set of one or more rings (closed line strings), with the first representing the shape (external ring) and the rest representing holes in that shape (internal rings). For example: POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))

SMALLINT

2

Minimum value: -32,767; maximum value: 32,767.

SMALLINT ENCODING FIXED(8)

1

Minumum value: -127 ; maximum value: 127.

TEXT ENCODING DICT

4

Max cardinality 2 billion distinct string values. Maximum string length is 32,767.

TEXT ENCODING DICT(8)

1

Max cardinality 255 distinct string values.

TEXT ENCODING DICT(16)

2

Max cardinality 64 K distinct string values.

TEXT ENCODING NONE

Variable

Size of the string + 6 bytes. Maximum string length is 32,767.

TIME

8

Minimum value: 00:00:00; maximum value: 23:59:59.

TIME ENCODING FIXED(32)

4

Minimum value: 00:00:00; maximum value: 23:59:59.

TIMESTAMP(0)

8

Linux timestamp from -30610224000 (1/1/1000 00:00:00) through 29379542399 (12/31/2900 23:59:59). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS or YYYY-MM-DDTHH:MM:SS (the T is dropped when the field is populated).

TIMESTAMP(3) (milliseconds)

8

Linux timestamp from -30610224000000 (1/1/1000 00:00:00.000) through 29379542399999 (12/31/2900 23:59:59.999). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS.fff or YYYY-MM-DDTHH:MM:SS.fff (the T is dropped when the field is populated).

TIMESTAMP(6) (microseconds)

8

Linux timestamp from -30610224000000000 (1/1/1000 00:00:00.000000) through 29379542399999999 (12/31/2900 23:59:59.999999). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS.ffffff or YYYY-MM-DDTHH:MM:SS.ffffff (the T is dropped when the field is populated).

TIMESTAMP(9) (nanoseconds)

8

Linux timestamp from -9223372036854775807 (09/21/1677 00:12:43.145224193) through 9223372036854775807 (11/04/2262 23:47:16.854775807). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS.fffffffff or YYYY-MM-DDTHH:MM:SS.fffffffff (the T is dropped when the field is populated).

TIMESTAMP ENCODING FIXED(32)

4

Range: 1901-12-13 20:45:53 - 2038-01-19 03:14:07. Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS or YYYY-MM-DDTHH:MM:SS (the T is dropped when the field is populated).

TINYINT

1

Minimum value: -127; maximum value: 127.

[1] - In HEAVY.AI release 4.4.0 and higher, you can use existing 8-byte DATE columns, but you can create only 4-byte DATE columns (default) and 2-byte DATE columns (see DATE ENCODING DAYS(16)).

  • HEAVY.AI does not support geometry arrays.

  • Timestamp values are always stored in 8 bytes. The greater the precision, the lower the fidelity.

Geospatial Datatypes

HEAVY.AI supports the LINESTRING, MULTILINESTRING, POLYGON, MULTIPOLYGON, POINT, and MULTIPOINT geospatial datatypes.

In the following example:

  • p0, p1, ls0, and poly0 are simple (planar) geometries.

  • p4 is point geometry with Web Mercator longitude/latitude coordinates.

  • p2, p3, mp, ls1, ls2, mls1, mls2, poly1, and mpoly0 are geometries using WGS84 SRID=4326 longitude/latitude coordinates.

CREATE TABLE geo ( name TEXT ENCODING DICT(32),
                   p0 POINT,
                   p1 GEOMETRY(POINT),
                   p2 GEOMETRY(POINT, 4326),
                   p3 GEOMETRY(POINT, 4326) ENCODING NONE,
                   p4 GEOMETRY(POINT, 900913),
                   mp GEOMETRY(MULTIPOINT, 4326),
                   ls0  LINESTRING,
                   ls1 GEOMETRY(LINESTRING, 4326) ENCODING COMPRESSED(32),
                   ls2 GEOMETRY(LINESTRING, 4326) ENCODING NONE,
                   mls1 GEOMETRY(MULTILINESTRING, 4326) ENCODING COMPRESSED(32),
                   mls2 GEOMETRY(MULTILINESTRING, 4326) ENCODING NONE,
                   poly0 POLYGON,
                   poly1 GEOMETRY(POLYGON, 4326) ENCODING COMPRESSED(32),
                   mpoly0 GEOMETRY(MULTIPOLYGON, 4326)
                  );

Storage

Geometry storage requirements are largely dependent on coordinate data. Coordinates are normally stored as 8-byte doubles, two coordinates per point, for all points that form a geometry. Each POINT geometry in the p1 column, for example, requires 16 bytes.

Compression

WGS84 (SRID 4326) coordinates are compressed to 32 bits by default. This sacrifices some precision but reduces storage requirements by half.

For example, columns p2, mp, ls1, mls1, poly1, and mpoly0 in the table defined above are compressed. Each geometry in the p2 column requires 8 bytes, compared to 16 bytes for p0.

You can explicitly disable compression. WGS84 columns p3, ls2, mls2 are not compressed and continue using doubles. Simple (planar) columns p0, p1, ls0, poly1 and non-4326 column p4 are not compressed.

Defining Arrays

Define datatype arrays by appending square brackets, as shown in the arrayexamples DDL sample.

CREATE TABLE arrayexamples (
  tiny_int_array TINYINT[],
  int_array INTEGER[],
  big_int_array BIGINT[],
  text_array TEXT[] ENCODING DICT(32), --HeavyDB supports only DICT(32) TEXT arrays.
  float_array FLOAT[],
  double_array DOUBLE[],
  decimal_array DECIMAL(18,6)[],
  boolean_array BOOLEAN[],
  date_array DATE[],
  time_array TIME[],
  timestamp_array TIMESTAMP[])

You can also define fixed-length arrays. For example:

CREATE TABLE arrayexamples (
  float_array3 FLOAT[3],
  date_array4 DATE[4]

Fixed-length arrays require less storage space than variable-length arrays.

Fixed Encoding

To use fixed-length fields, the range of the data must fit into the constraints as described. Understanding your schema and the scope of potential values in each field helps you to apply fixed encoding types and save significant storage space.

These encodings are most effective on low-cardinality TEXT fields, where you can achieve large savings of storage space and improved processing speed, and on TIMESTAMP fields where the timestamps range between 1901-12-13 20:45:53 and 2038-01-19 03:14:07. If a TEXT ENCODING field does not match the defined cardinality, HEAVY.AI substitutes a NULL value and logs the change.

For DATE types, you can use the terms FIXED and DAYS interchangeably. Both are synonymous for the DATE type in HEAVY.AI.

Some of the INTEGER options overlap. For example, INTEGER ENCODINGFIXED(8) and SMALLINT ENCODINGFIXED(8) are essentially identical.

Shared Dictionaries

You can improve performance of string operations and optimize storage using shared dictionaries. You can share dictionaries within a table or between different tables in the same database. The table with which you want to share dictionaries must exist when you create the table that references the TEXT ENCODING DICT field, and the column that you are referencing in that table must also exist. The following small DDL shows the basic structure:

CREATE TABLE text_shard (
i TEXT ENCODING DICT(32),
s TEXT ENCODING DICT(32),
SHARD KEY (i))
WITH (SHARD_COUNT = 2);

CREATE TABLE text_shard1 (
i TEXT,
s TEXT ENCODING DICT(32),
SHARD KEY (i),
SHARED DICTIONARY (i) REFERENCES text_shard(i))
WITH (SHARD_COUNT = 2);

In the table definition, make sure that referenced columns appear before the referencing columns.

For example, this DDL is a portion of the schema for the flights database. Because airports are both origin and destination locations, it makes sense to reuse the same dictionaries for name, city, state, and country values.

create table flights (
*
*
*
dest_name TEXT ENCODING DICT,
dest_city TEXT ENCODING DICT,
dest_state TEXT ENCODING DICT,
dest_country TEXT ENCODING DICT,

*
*
*
origin_name TEXT,
origin_city TEXT,
origin_state TEXT,
origin_country TEXT,
*
*
*

SHARED DICTIONARY (origin_name) REFERENCES flights(dest_name),
SHARED DICTIONARY (origin_city) REFERENCES flights(dest_city),
SHARED DICTIONARY (origin_state) REFERENCES flights(dest_state),
SHARED DICTIONARY (origin_country) REFERENCES flights(dest_country),
*
*
*
)
WITH(
*
*
*
)

To share a dictionary in a different existing table, replace the table name in the REFERENCES instruction. For example, if you have an existing table called us_geography, you can share the dictionary by following the pattern in the DDL fragment below.

create table flights (

*
*
*

SHARED DICTIONARY (origin_city) REFERENCES us_geography(city),
SHARED DICTIONARY (origin_state) REFERENCES us_geography(state),
SHARED DICTIONARY (origin_country) REFERENCES us_geography(country),
SHARED DICTIONARY (dest_city) REFERENCES us_geography(city),
SHARED DICTIONARY (dest_state) REFERENCES us_geography(state),
SHARED DICTIONARY (dest_country) REFERENCES us_geography(country),

*
*
*
)
WITH(
*
*
*
);

The referencing column cannot specify the encoding of the dictionary, because it uses the encoding from the referenced column.

Views

DDL - Views

A view is a virtual table based on the result set of a SQL statement. It derives its fields from a SELECT statement. You can do anything with a HEAVY.AI view query that you can do in a non-view HEAVY.AI query.

Nomenclature Constraints

[A-Za-z_][A-Za-z0-9\$_]*

CREATE VIEW

Creates a view based on a SQL statement.

Example

CREATE VIEW view_movies
AS SELECT movies.movieId, movies.title, movies.genres, avg(ratings.rating)
FROM ratings
JOIN movies on ratings.movieId=movies.movieId
GROUP BY movies.title, movies.movieId, movies.genres;

You can describe the view as you would a table.

\d view_movies
VIEW defined AS: SELECT  movies.movieId, movies.title, movies.genres,
avg(ratings.rating) FROM ratings JOIN movies ON ratings.movieId=movies.movieId
GROUP BY movies.title, movies.movieId, movies.genres
Column types:
    movieId INTEGER,
    title TEXT ENCODING DICT(32),
    genres TEXT ENCODING DICT(32),
    EXPR$3 DOUBLE

You can query the view as you would a table.

SELECT title, EXPR$3 from view_movies where movieId=260;
Star Wars: Episode IV - A New Hope (1977)|4.048937

DROP VIEW

Removes a view created by the CREATE VIEW statement. The view definition is removed from the database schema, but no actual data in the underlying base tables is modified.

Example

DROP VIEW IF EXISTS v_reviews;

Exporting Data

COPY TO

COPY ( <SELECT statement> ) TO '<file path>' [WITH (<property> = value, ...)];

<file path> must be a path on the server. This command exports the results of any SELECT statement to the file. There is a special mode when <file path> is empty. In that case, the server automatically generates a file in <HEAVY.AI Directory>/export that is the client session id with the suffix .txt.

Available properties in the optional WITH clause are described in the following table.

Parameter

Description

Default Value

array_null_handling

Define how to export with arrays that have null elements:

  • 'abort' - Abort the export. Default.

  • 'raw' - Export null elements as raw values.

  • 'zero' - Export null elements as zero (or an empty string).

  • 'nullfield' - Set the entire array column field to null for that row.

Applies only to GeoJSON and GeoJSONL files.

'abort'

delimiter

A single-character string for the delimiter between column values; most commonly:

  • , for CSV files

  • \t for tab-delimited files

Other delimiters include | ,~, ^, and;.

Applies to only CSV and tab-delimited files.

Note: HEAVY.AI does not use file extensions to determine the delimiter.

',' (CSV file)

escape

A single-character string for escaping quotes. Applies to only CSV and tab-delimited files.

' (quote)

file_compression

File compression; can be one of the following:

  • 'none'

  • 'gzip'

  • 'zip'

For GeoJSON and GeoJSONL files, using GZip results in a compressed single file with a .gz extension. No other compression options are currently available.

'none'

file_type

Type of file to export; can be one of the following:

  • 'csv' - Comma-separated values file.

  • 'geojson' - FeatureCollection GeoJSON file.

  • 'geojsonl' - Multiline GeoJSONL file.

  • 'shapefile' - Geospatial shapefile.

For all file types except CSV, exactly one geo column (POINT, LINESTRING, POLYGON or MULTIPOLYGON) must be projected in the query. CSV exports can contain zero or any number of geo columns, exported as WKT strings.

Export of array columns to shapefiles is not supported.

'csv'

header

Either 'true' or 'false', indicating whether to output a header line for all the column names. Applies to only CSV and tab-delimited files.

'true'

layer_name

A layer name for the geo layer in the file. If unspecified, the stem of the given filename is used, without path or extension.

Applies to all file types except CSV.

Stem of the filename, if unspecified

line_delimiter

A single-character string for terminating each line. Applies to only CSV and tab-delimited files.

'\n'

nulls

A string pattern indicating that a field is NULL. Applies to only CSV and tab-delimited files.

An empty string, 'NA', or \N

quote

A single-character string for quoting a column value. Applies to only CSV and tab-delimited files.

" (double quote)

quoted

Either 'true' or 'false', indicating whether all the column values should be output in quotes. Applies to only CSV and tab-delimited files.

'true'

When using the COPY TO command, you might encounter the following error:

Query couldn’t keep the entire working set of columns in GPU Memory.

Example

COPY (SELECT * FROM tweets) TO '/tmp/tweets.csv';
COPY (SELECT * tweets ORDER BY tweet_time LIMIT 10000) TO
  '/tmp/tweets.tsv' WITH (delimiter = '\t', quoted = 'true', header = 'false');

Loading Data with SQL

This topic describes several ways to load data to HEAVY.AI using SQL commands.

  • If a source file uses a reserved word, HEAVY.AI automatically adds an underscore at the end of the reserved word. For example, year is converted to year_.

COPY FROM

CSV/TSV Import

Use the following syntax for CSV and TSV files:

COPY <table> FROM '<file pattern>' [WITH (<property> = value, ...)];

<file pattern> must be local on the server. The file pattern can contain wildcards if you want to load multiple files. In addition to CSV, TSV, and TXT files, you can import compressed files in TAR, ZIP, 7-ZIP, RAR, GZIP, BZIP2, or TGZ format.

COPY FROM appends data from the source into the target table. It does not truncate the table or overwrite existing data.

You can import client-side files (\copy command in heavysql) but it is significantly slower. For large files, HEAVY.AI recommends that you first scp the file to the server, and then issue the COPY command.

HEAVYAI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI.

Available properties in the optional WITH clause are described in the following table.

Parameter

Description

Default Value

array_delimiter

A single-character string for the delimiter between input values contained within an array.

, (comma)

array_marker

A two-character string consisting of the start and end characters surrounding an array.

{ }(curly brackets). For example, data to be inserted into a table with a string array in the second column (for example, BOOLEAN, STRING[], INTEGER) can be written as true,{value1,value2,value3},3

buffer_size

Size of the input file buffer, in bytes.

8388608

delimiter

A single-character string for the delimiter between input fields; most commonly:

  • , for CSV files

  • \t for tab-delimited files

Other delimiters include | ,~, ^, and;.

Note: OmniSci does not use file extensions to determine the delimiter.

',' (CSV file)

escape

A single-character string for escaping quotes.

'"' (double quote)

geo

Import geo data. Deprecated and scheduled for removal in a future release.

'false'

header

Either 'true' or 'false', indicating whether the input file has a header line in Line 1 that should be skipped.

'true'

line_delimiter

A single-character string for terminating each line.

'\n'

lonlat

In OmniSci, POINT fields require longitude before latitude. Use this parameter based on the order of longitude and latitude in your source data.

'true'

max_reject

Number of records that the COPY statement allows to be rejected before terminating the COPY command. Records can be rejected for a number of reasons, including invalid content in a field, or an incorrect number of columns. The details of the rejected records are reported in the ERROR log. COPY returns a message identifying how many records are rejected. The records that are not rejected are inserted into the table, even if the COPY stops because the max_reject count is reached.

Note: If you run the COPY command from OmniSci Immerse, the COPY command does not return messages to Immerse once the SQL is verified. Immerse does not show messages about data loading, or about data-quality issues that result in max_reject triggers.

100,000

nulls

A string pattern indicating that a field is NULL.

An empty string, 'NA', or \N

parquet

Import data in Parquet format. Parquet files can be compressed using Snappy. Other archives such as .gz or .zip must be unarchived before you import the data. Deprecated and scheduled for removal in a future release.

'false'

plain_text

Indicates that the input file is plain text so that it bypasses the libarchive decompression utility.

CSV, TSV, and TXT are handled as plain text.

quote

A single-character string for quoting a field.

" (double quote). All characters inside quotes are imported “as is,” except for line delimiters.

quoted

Either 'true' or 'false', indicating whether the input file contains quoted fields.

'true'

source_srid

When importing into GEOMETRY(*, 4326) columns, specifies the SRID of the incoming geometries, all of which are transformed on the fly. For example, to import from a file that contains EPSG:2263 (NAD83 / New York Long Island) geometries, run the COPY command and include WITH (source_srid=2263). Data targeted at non-4326 geometry columns is not affected.

0

source_type='<type>'

Type can be one of the following:

delimited_file - Import as CSV.

geo_file - Import as Geo file. Use for shapefiles, GeoJSON, and other geo files. Equivalent to deprecated geo='true'.

raster_file - Import as a raster file.

parquet_file - Import as a Parquet file. Equivalent to deprecated parquet='true'.

delimited_file

threads

Number of threads for performing the data import.

Number of CPU cores on the system

trim_spaces

Indicate whether to trim side spaces ('true') or not ('false').

'false'

By default, the CSV parser assumes one row per line. To import a file with multiple lines in a single field, specify threads = 1 in the WITH clause.

Examples

COPY tweets FROM '/tmp/tweets.csv' WITH (nulls = 'NA'); 
COPY tweets FROM '/tmp/tweets.tsv' WITH (delimiter = '\t', quoted = 'false'); 
COPY tweets FROM '/tmp/*' WITH (header='false'); 
COPY trips FROM '/mnt/trip/trip.parquet/part-00000-0284f745-1595-4743-b5c4-3aa0262e4de3-c000.snappy.parquet' with (parquet='true');

Geo Import

You can use COPY FROM to import geo files. You can create the table based on the source file and then load the data:

COPY FROM 'source' WITH (source_type='geo_file', ...);

You can also append data to an existing, predefined table:

COPY tableName FROM 'source' WITH (source_type='geo_file', ...);

Use the following syntax, depending on the file source.

Local server

COPY [tableName] FROM '/filepath' WITH (source_type='geo_file', ...);

Web site

COPY [tableName] FROM '[http _https_]://_website/filepath_' WITH (source_type='geo_file', ...);

Amazon S3

COPY [tableName] FROM 's3://bucket/filepath' WITH (source_type='geo_file', s3_region='region', s3_access_key='accesskey', s3_secret_key='secretkey', ... );

  • If you are using COPY FROM to load to an existing table, the field type must match the metadata of the source file. If it does not, COPY FROM throws an error and does not load the data.

  • COPY FROM appends data from the source into the target table. It does not truncate the table or overwrite existing data.

  • Supported DATE formats when using COPY FROM include mm/dd/yyyy, dd-mmm-yy, yyyy-mm-dd, and dd/mmm/yyyy.

  • COPY FROM fails for records with latitude or longitude values that have more than 4 decimal places.

The following WITH options are available for geo file imports from all sources.

geo_coords_type

Coordinate type used; must be geography.

N/A

geo_coords_encoding

Coordinates encoding; can be geoint(32) or none.

geoint(32)

geo_coords_srid

Coordinates spatial reference; must be 4326 (WGS84 longitude/latitude).

N/A

geo_explode_collections

Explodes MULTIPOLYGON, MULTILINESTRING, or MULTIPOINT geo data into multiple rows in a POLYGON, LINESTRING, or POINT column, with all other columns duplicated.

When importing from a WKT CSV with a MULTIPOLYGON column, the table must have been manually created with a POLYGON column.

When importing from a geo file, the table is automatically created with the correct type of column.

When the input column contains a mixture of MULTI and single geo, the MULTI geo are exploded, but the singles are imported normally. For example, a column containing five two-polygon MULTIPOLYGON rows and five POLYGON rows imports as a POLYGON column of fifteen rows.

false

geo_validate_geometry

Boolean. If enabled, the importer passes any incoming POLYGON or MULTIPOLYGON data through a validation process. If the geo is considered invalid by OGC (PostGIS) standards (for example, self-intersecting polygons), then the row or feature that contains it is rejected.

Currently, a manually created geo table can have only one geo column. If it has more than one, import is not performed.

Any GDAL-supported file type can be imported. If it is not supported, GDAL throws an error.

The first compatible file in the bundle is loaded; subfolders are traversed until a compatible file is found. The rest of the contents in the bundle are ignored. If the bundle contains multiple filesets, unpack the file manually and specify it for import.

CSV files containing WKT strings are not considered geo files and should not be parsed with the source_type='geo' option. When importing WKT strings from CSV files, you must create the table first. The geo column type and encoding are specified as part of the DDL. For example, for a polygon with no encoding, try the following:

ggpoly GEOMETRY(POLYGON, 4326) ENCODING COMPRESSED(32)

Raster Import

You can use COPY FROM to import raster files supported by GDAL as one row per pixel, where a pixel may consist of one or more data bands, with optional corresponding pixel or world-space coordinate columns. This allows the data to be rendered as a point/symbol cloud that approximates a 2D image.

COPY FROM 'source' WITH (source_type='raster_file', ...);

The following WITH options are available for raster file imports from all sources.

Parameter
Description
Default Value

raster_import_bands='<bandname>[,<bandname>,...]'

An empty string, indicating to import all bands from all datasets found in the file.

raster_point_transform='<transform>'

Specifies the processing for floating-point coordinate values: auto - Transform based on raster file type (world for geo, none for non-geo).

none - No affine or world-space conversion. Values will be equivalent to the integer pixel coordinates.

file - File-space affine transform only. Values will be in the file's coordinate system, if any (e.g. geospatial).

world - World-space geospatial transform. Values will be projected to WGS84 lon/lat (if the file has a geospatial SRID).

auto

raster_point_type='<type>'

Specifies the required type for the additional pixel coordinate columns: auto - Create columns based on raster file type (double for geo, int or smallint for non-geo, dependent on size).

none - Do not create pixel coordinate columns.

smallint or int - Create integer columns of names raster_x and raster_y and fill with the raw pixel coordinates from the file.

float or double - Create floating-point columns of names raster_x and raster_y (or raster_lon and raster_lat) and fill with file-space or world-space projected coordinates.

point - Create a POINT column of name raster_point and fill with file-space or world-space projected coordinates.

auto

Illegal combinations of raster_point_type and raster_point_transform are rejected. For example, world transform can only be performed on raster files that have a geospatial coordinate system in their metadata, and cannot be performed if <type> is an integer format (which cannot represent world-space coordinate values).

Any GDAL-supported file type can be imported. If it is not supported, GDAL throws an error.

HDF5 and possibly other GDAL drivers may not be thread-safe, so use WITH (threads=1) when importing.

Archive file import (.zip, .tar, .tar.gz) is not currently supported for raster files.

Band and Column Names

The following raster file formats contain the metadata required to derive sensible names for the bands, which are then used for their corresponding columns:

  • GRIB2 - geospatial/meteorological format

  • OME TIFF - an OpenMicroscopy format

The band names from the file are sanitized (illegal characters and spaces removed) and de-duplicated (addition of a suffix in cases where the same band name is repeated within the file or across datasets).

For other formats, the columns are named band_1_1, band_1_2 , and so on.

The sanitized and de-duplicated names must be used for the raster_import_bands option.

Band and Column Data Types

Raster files can have bands in the following data types:

  • Signed or unsigned 8-, 16-, or 32-bit integer

  • 32- or 64-bit floating point

  • Complex number formats (not supported)

Signed data is stored in the directly corresponding column type, as follows:

int8 -> TINYINT int16 -> SMALLINT int32 -> INT float32 -> FLOAT float64 -> DOUBLE

Unsigned integer column types are not currently supported, so any data of those types is converted to the next size up signed column type:

uint8 -> SMALLINT uint16 -> INT uint32 -> BIGINT

Column types cannot currently be overridden.

ODBC Import

ODBC import is currently a beta feature.

You can use COPY FROM to import data from a Relational Database Management System (RDMS) or data warehouse using the Open Database Connectivity (ODBC) interface.

COPY <table_name> FROM '<select_query>' WITH (source_type = 'odbc', ...);

The following WITH options are available for ODBC import.

data_source_name

Data source name (DSN) configured in the odbc.ini file. Only one of data_source_name or connection_string can be specified.

connection_string

A set of semicolon-separated key=value pairs that define the connection parameters for an RDMS. For example: Driver=DriverName;Database=DatabaseName;Servername=HostName;Port=1234

Only one of data_source_name or connection_string can be specified.

sql_order_by

Comma-separated list of column names that provide a unique ordering for the result set returned by the specified SQL SELECT statement.

username

Username on the RDMS. Applies only when data_source_name is used.

password

Password credential for the RDMS. This option only applies when data_source_name is used.

credential_string

A set of semicolon separated “key=value” pairs, which define the access credential parameters for an RDMS. For example:

Username=username;Password=password

Applies only when connection_string is used.

Examples

Using a data source name:

COPY example_table
  FROM 'SELECT * FROM remote_postgres_table WHERE event_timestamp > ''2020-01-01'';'
  WITH 
    (source_type = 'odbc', 
     sql_order_by = 'event_timestamp',
     data_source_name = 'postgres_db_1',
     username = 'my_username',
     password = 'my_password');

Using a connection string:

COPY example_table
  FROM 'SELECT * FROM remote_postgres_table WHERE event_timestamp > ''2020-01-01'';'
  WITH 
    (source_type = 'odbc',
     sql_order_by = 'event_timestamp',
     connection_string = 'Driver=PostgreSQL;Database=my_postgres_db;Servername=my_postgres.example.com;Port=1234',
     credential_string = 'Username=my_username;Password=my_password');

Globbing, Filtering, and Sorting Parquet and CSV Files

These examples assume the following folder and file structure:

Globbing

Local Parquet/CSV files can now be globbed by specifying either a path name with a wildcard or a folder name.

Globbing a folder recursively returns all files under the specified folder. For example,

COPY table_1 FROM ".../subdir";

returns file_3, file_4, file_5.

Globbing with a wildcard returns any file paths matching the expanded file path. So

COPY table_1 FROM ".../subdir/file*"; returns file_3, file_4.

Does not apply to S3 cases, because file paths specified for S3 always use prefix matching.

Filtering

Use file filtering to filter out unwanted files that have been globbed. To use filtering, specify the REGEX_PATH_FILTER option. Files not matching this pattern are not included on import. Consistent across local and S3 use cases.

The following regex expression:

COPY table_1 from ".../" WITH (REGEX_PATH_FILTER=".*file_[4-5]");

returns file_4, file_5.

Sorting

Use the FILE_SORT_ORDER_BY option to specify the order in which files are imported.

FILE_SORT_ORDER_BY Options

  • pathname (default)

  • date_modified

  • regex *

  • regex_date *

  • regex_number *

*FILE_SORT_REGEX option required

Using FILE_SORT_ORDER_BY

COPY table_1 from ".../" WITH (FILE_SORT_ORDER_BY="date_modified");

Using FILE_SORT_ORDER_BY with FILE_SORT_REGEX

Regex sort keys are formed by the concatenation of all capture groups from the FILE_SORT_REGEX expression. Regex sort keys are strings but can be converted to dates or FLOAT64 with the appropriate FILE_SORT_ORDER_BY option. File paths that do not match the provided capture groups or that cannot be converted to the appropriate date or FLOAT64 are treated as NULLs and sorted to the front in a deterministic order.

Multiple Capture Groups:

FILE_SORT_REGEX=".*/data_(.*)_(.*)_" /root/dir/unmatchedFile → <NULL> /root/dir/data_andrew_54321_ → andrew54321 /root/dir2/data_brian_Josef_ → brianJosef

Dates:

FILE_SORT_REGEX=".*data_(.*) /root/data_222 → <NULL> (invalid date conversion) /root/data_2020-12-31 → 2020-12-31 /root/dir/data_2021-01-01 → 2021-01-01

Import:

COPY table_1 from ".../" WITH (FILE_SORT_ORDER_BY="regex", FILE_SORT_REGEX=".*file_(.)");

Geo and Raster File Globbing

Limited filename globbing is supported for both geo and raster import. For example, to import a sequence of same-format GeoTIFF files into a single table, you can run the following:

COPY table FROM '/path/path/something_*.tiff' WITH (source_type='raster_file')

The files are imported in alphanumeric sort order, per regular glob rules, and all appended to the same table. This may fail if the files are not all of the same format (band count, names, and types).

For non-geo/raster files (CSV and Parquet), you can provide just the path to the directory OR a wildcard; for example:

/path/to/directory/ /path/to/directory /path/to/directory/*

For geo/raster files, a wildcard is required, as shown in the last example.

SQLImporter

SQLImporter is a Java utility run at the command line. It runs a SELECT statement on another database through JDBC and loads the result set into HeavyDB.

Usage

java -cp [HEAVY.AI utility jar file]:[3rd party JDBC driver]
SQLImporter
-u <userid>; -p <password>; [(--binary|--http|--https [--insecure])]
-s <heavyai server host> -db <omnsci db> --port <heavyai server port>
[-d <other database JDBC drive class>] -c <other database JDBC connection string>
-su <other database user> -sp <other database user password> -ss <other database sql statement>
-t <HEAVY.AI target table> -b <transfer buffer size> -f <table fragment size>
[-tr] [-nprg] [-adtf] [-nlj] -i <init commands file>

Flags

-r                                     Row load limit 
-h,--help                              Help message
-r <arg>;                              Row load limit 
-h,--help                              Help message 
-u,--user <arg>;                       HEAVY.AI user 
-p,--passwd <arg>;                     HEAVY.AI password 
--binary                               Use binary transport to connect to HEAVY.AI 
--http                                 Use http transport to connect to HEAVY.AI 
--https                                Use https transport to connect to HEAVY.AI 
-s,--server <arg>;                     HEAVY.AI Server 
-db,--database <arg>;                  HEAVY.AI Database 
--port <arg>;                          HEAVY.AI Port 
--ca-trust-store <arg>;                CA certificate trust store 
--ca-trust-store-passwd <arg>;         CA certificate trust store password 
--insecure <arg>;                      Insecure TLS - Do not validate server HEAVY.AI 
                                       server certificates 
-d,--driver <arg>;                     JDBC driver class 
-c,--jdbcConnect <arg>;                JDBC connection string 
-su,--sourceUser <arg>;                Source user 
-sp,--sourcePasswd <arg>;              Source password 
-ss,--sqlStmt <arg>;                   SQL Select statement 
-t,--targetTable <arg>;                HEAVY.AI Target Table 
-b,--bufferSize <arg>;                 Transfer buffer size 
-f,--fragmentSize <arg>;               Table fragment size 
-tr,--truncate                         Truncate table if it exists 
-nprg,--noPolyRenderGroups             Disable render group assignment  
-adtf,--allowDoubleToFloat             Allow narrow casting
-nlj,--no-log-jdbc-connection-string   Omit JDBC connection string from logs   
-i,--initializeFile <arg>;             File containing init command for DB

HEAVY.AI recommends that you use a service account with read-only permissions when accessing data from a remote database.

In release 4.6 and higher, the user ID (-u) and password (-p) flags are required. If your password includes a special character, you must escape the character using a backslash (\).

If the table does not exist in HeavyDB, SQLImporter creates it. If the target table in HeavyDB does not match the SELECT statement metadata, SQLImporter fails.

If the truncate flag is used, SQLImporter truncates the table in HeavyDB before transferring the data. If the truncate flag is not used, SQLImporter appends the results of the SQL statement to the target table in HeavyDB.

The -i argument provides a path to an initialization file. Each line of the file is sent as a SQL statement to the remote database. You can use -i to set additional custom parameters before the data is loaded.

The SQLImporter string is case-sensitive. Incorrect case returns the following:

Error: Could not find or load main class com.mapd.utility.SQLimporter

PostgreSQL/PostGIS Support

You can migrate geo data types from a PostgreSQL database. The following table shows the correlation between PostgreSQL/PostGIS geo types and HEAVY.AI geo types.

point

point

lseg

linestring

linestring

linestring

polygon

polygon

multipolygon

multipolygon

Other PostgreSQL types, including circle, box, and path, are not supported.

HeavyDB Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version}>.jar 
com.mapd.utility.SQLImporter -u admin -p HyperInteractive -db heavyai --port 6274 
-t mytable -su admin -sp HyperInteractive -c "jdbc:heavyai:myhost:6274:heavyai" 
-ss "select * from mytable limit 1000000000"

By default, 100,000 records are selected from HeavyDB. To select a larger number of records, use the LIMIT statement.

Hive Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version>.jar:/hive-jdbc-1.2.1000.2.6.1.0-129-standalone.jar
com.mapd.utility.SQLImporter
-u user -p password
-db Heavyai_database_name --port 6274 -t Heavyai_table_name
-su source_user -sp source_password
-c "jdbc:hive2://server_address:port_number/database_name"
-ss "select * from source_table_name"

Google Big Query Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version>.jar:./GoogleBigQueryJDBC42.jar:
./google-oauth-client-1.22.0.jar:./google-http-client-jackson2-1.22.0.jar:./google-http-client-1.22.0.jar:./google-api-client-1.22.0.jar:
./google-api-services-bigquery-v2-rev355-1.22.0.jar 
com.mapd.utility.SQLImporter
-d com.simba.googlebigquery.jdbc42.Driver 
-u user -p password
-db Heavyai_database_name --port 6274 -t Heavyai_table_name
-su source_user -sp source_password 
-c "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=project-id;OAuthType=0;
OAuthServiceAcctEmail==email@domain.iam.gserviceaccount.com;OAuthPvtKeyPath=/home/simba/myproject.json;"
-ss "select * from schema.source_table_name"

PostgreSQL Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version>.jar:/tmp/postgresql-42.2.5.jar 
com.mapd.utility.SQLImporter 
-u user -p password
-db Heavyai_database_name --port 6274 -t Heavyai_table_name
-su source_user -sp source_password 
-c "jdbc:postgresql://127.0.0.1/postgres"
-ss "select * from schema_name.source_table_name"

SQLServer Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version>.jar:/path/sqljdbc4.jar
com.mapd.utility.SQLImporter
-d com.microsoft.sqlserver.jdbc.SQLServerDriver 
-u user -p password
-db Heavyai_database_name --port 6274 -t Heavyai_table_name
-su source_user -sp source_password 
-c "jdbc:sqlserver://server:port;DatabaseName=database_name"
-ss "select top 10 * from dbo.source_table_name"

MySQL Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version>.jar:mysql/mysql-connector-java-5.1.38-bin.jar
com.mapd.utility.SQLImporter 
-u user -p password
-db Heavyai_database_name --port 6274 -t Heavyai_table_name
-su source_user -sp source_password 
-c "jdbc:mysql://server:port/database_name"
-ss "select * from schema_name.source_table_name"

StreamInsert

Stream data into HeavyDB by attaching the StreamInsert program to the end of a data stream. The data stream can be another program printing to standard out, a Kafka endpoint, or any other real-time stream output. You can specify the appropriate batch size, according to the expected stream rates and your insert frequency. The target table must exist before you attempt to stream data into the table.

<data stream> | StreamInsert <table name> <database name> \
{-u|--user} <user> {-p|--passwd} <password> [{--host} <hostname>] \
[--port <port number>][--delim <delimiter>][--null <null string>] \
[--line <line delimiter>][--batch <batch size>][{-t|--transform} \
transformation ...][--retry_count <num_of_retries>] \
[--retry_wait <wait in secs>][--print_error][--print_transform]

Setting

Default

Description

<table_name>

n/a

Name of the target table in OmniSci

<database_name>

n/a

Name of the target database in OmniSci

-u

n/a

User name

-p

n/a

User password

--host

n/a

Name of OmniSci host

--delim

comma (,)

Field delimiter, in single quotes

--line

newline (\n)

Line delimiter, in single quotes

--batch

10000

Number of records in a batch

--retry_count

10

Number of attempts before job fails

--retry_wait

5

Wait time in seconds after server connection failure

--null

n/a

String that represents null values

--port

6274

Port number for OmniSciDB on localhost

`-t

--transform`

n/a

Regex transformation

--print_error

False

Print error messages

--print_transform

False

Print description of transform.

--help

n/a

List options

Example

cat file.tsv | /path/to/heavyai/SampleCode/StreamInsert stream_example \
heavyai --host localhost --port 6274 -u imauser -p imapassword \
--delim '\t' --batch 1000

Importing AWS S3 Files

You can use the SQL COPY FROM statement to import files stored on Amazon Web Services Simple Storage Service (AWS S3) into an HEAVY.AI table, in much the same way you would with local files. In the WITH clause, specify the S3 credentials and region information of the bucket accessed.

COPY <table> FROM '<S3_file_URL>' WITH ([[s3_access_key = '<key_name>',s3_secret_key = '<key_secret>',] | [s3_session_token - '<AWS_session_token']] s3_region = '<region>');

HEAVY.AI does not support the use of asterisks (*) in URL strings to import items. To import multiple files, pass in an S3 path instead of a file name, and COPY FROM imports all items in that path and any subpath.

Custom S3 Endpoints

HEAVY.AI supports custom S3 endpoints, which allows you to import data from S3-compatible services, such as Google Cloud Storage.

To use custom S3 endpoints, add s3_endpoint to the WITH clause of a COPY FROM statement; for example, to set the S3 endpoint to point to Google Cloud Services:

COPY trips FROM 's3://heavyai-importtest-data/trip-data/trip_data_9.gz' WITH (header='true', s3_endpoint='storage.googleapis.com');

You can also configure custom S3 endpoints by passing the s3_endpoint field to Thrift import_table.

Examples

heavysql> COPY trips FROM 's3://heavyai-s3-no-access/trip_data_9.gz';
Exception: failed to list objects of s3 url 's3://heavyai-s3-no-access/trip_data_9.gz': AccessDenied: Access Denied
heavysql> COPY trips FROM 's3://heavyai-s3-no-access/trip_data_9.gz' with (s3_access_key='xxxxxxxxxx',s3_secret_key='yyyyyyyyy');
Exception: failed to list objects of s3 url 's3://heavyai-s3-no-access/trip_data_9.gz': AuthorizationHeaderMalformed: Unable to parse ExceptionName: AuthorizationHeaderMalformed Message: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-1'
heavysql> COPY trips FROM 's3://heavyai-testdata/trip.compressed/trip_data_9.csv' with (s3_access_key='xxxxxxxx',s3_secret_key='yyyyyyyy',s3_region='us-west-1');
Result
Loaded: 100 recs, Rejected: 0 recs in 0.361000 secs

The following example imports all the files in the trip.compressed directory.

heavysql> copy trips from 's3://heavyai-testdata/trip.compressed/' with (s3_access_key='xxxxxxxx',s3_secret_key='yyyyyyyy',s3_region='us-west-1');
Result
Loaded: 105200 recs, Rejected: 0 recs in 1.890000 secs

trips Table

The table trips is created with the following statement:

heavysql> \d trips
        CREATE TABLE trips (
        medallion TEXT ENCODING DICT(32),
        hack_license TEXT ENCODING DICT(32),
        vendor_id TEXT ENCODING DICT(32),
        rate_code_id SMALLINT,
        store_and_fwd_flag TEXT ENCODING DICT(32),
        pickup_datetime TIMESTAMP,
        dropoff_datetime TIMESTAMP,
        passenger_count SMALLINT,
        trip_time_in_secs INTEGER,
        trip_distance DECIMAL(14,2),
        pickup_longitude DECIMAL(14,2),
        pickup_latitude DECIMAL(14,2),
        dropoff_longitude DECIMAL(14,2),
        dropoff_latitude DECIMAL(14,2))
WITH (FRAGMENT_SIZE = 75000000);

Using Server Privileges to Access AWS S3

You can configure HEAVY.AI server to provide AWS credentials, which allows S3 Queries to be run without specifying AWS credentials. S3 Regions are not configured by the server, and will need to be passed in either as a client side environment variable or as an option with the request.

Example Commands

  • \detect: $ export AWS_REGION=us-west-1 heavysql > \detect <s3-bucket-uri

  • import_table: $ ./Heavyai-remote -h localhost:6274 import_table "'<session-id>'" "<table-name>" '<s3-bucket-uri>' 'TCopyParams(s3_region="'us-west-1'")'

  • COPY FROM: heavysql > COPY <table-name> FROM <s3-bucket-uri> WITH(s3_region='us-west-1');

Configuring AWS Credentials

  1. Enable server privileges in the server configuration file heavy.conf allow-s3-server-privileges = true

  2. For bare metal installations set the following environment variables and restart the HeavyDB service: AWS_ACCESS_KEY_ID=xxx AWS_SECRET_ACCESS_KEY=xxx AWS_SESSION_TOKEN=xxx (required only for AWS STS credentials)

  3. For HeavyDB docker images, start a new container mounted with the configuration file using the option: -v <dirname-containing-heavy.conf>:/var/lib/heavyai and set the following environment options: -e AWS_ACCESS_KEY_ID=xxx -e AWS_SECRET_ACCESS_KEY=xxx -e AWS_SESSION_TOKEN=xxx (required only for AWS STS credentials)

  1. Enable server privileges in the server configuration file heavy.conf allow-s3-server-privileges = true

  2. For bare metal installations Specify a shared AWS credentials file and profile with the following environment variables and restart the HeavyDB service. AWS_SHARED_CREDENTIALS_FILE=~/.aws/credentials AWS_PROFILE=default

  3. For HeavyDB docker images, start a new container mounted with the configuration file and AWS shared credentials file using the following options: -v <dirname-containing-/heavy.conf>:/var/lib/heavyai -v <dirname-containing-/credentials>:/<container-credential-path> and set the following environment options: -e AWS_SHARED_CREDENTIALS_FILE=<container-credential-path> -e AWS_PROFILE=<active-profile>

Prerequisites

  1. An IAM Policy that has sufficient access to the S3 bucket.

  2. An IAM AWS Service Role of type Amazon EC2 , which is assigned the IAM Policy from (1).

Setting Up an EC2 Instance with Roles

For a new EC2 Instance:

  1. AWS Management Console > Services > Compute > EC2 > Launch Instance.

  2. Select desired Amazon Machine Image (AMI) > Select.

  3. Select desired Instance Type > Next: Configure Instance Details.

  4. IAM Role > Select desired IAM Role > Review and Launch.

  5. Review other options > Launch.

For an existing EC2 Instance:

  1. AWS Management Console > Services > Compute > EC2 > Instances.

  2. Mark desired instance(s) > Actions > Security > Modify IAM Role.

  3. Select desired IAM Role > Save.

  4. Restart the EC2 Instance.

KafkaImporter

You can ingest data from an existing Kafka producer to an existing table in HEAVY.AI using KafkaImporter on the command line:

KafkaImporter <table_name> <database_name> {-u|--user <user_name> \
{-p|--passwd <user_password>} [{--host} <hostname>] \
[--port <HeavyDB_port>] [--http] [--https] [--skip-verify] \
[--ca-cert <path>] [--delim <delimiter>] [--batch <batch_size>] \
[{-t|--transform} transformation ...] [retry_count <retry_number>] \
[--retry_wait <delay_in_seconds>] --null <null_value_string> [--quoted true|false] \
[--line <line_delimiter>] --brokers=<broker_name:broker_port> \ 
--group-id=<kafka_group_id> --topic=<topic_type> [--print_error] [--print_transform]

KafkaImporter Options

Setting

Default

Description

<table_name>

n/a

Name of the target table in OmniSci

<database_name>

n/a

Name of the target database in OmniSci

-u <username>

n/a

User name

-p <password>

n/a

User password

--host <hostname>

localhost

Name of OmniSci host

--port <port_number>

6274

Port number for OmniSciDB on localhost

--http

n/a

Use HTTP transport

--https

n/a

Use HTTPS transport

--skip-verify

n/a

Do not verify validity of SSL certificate

--ca-cert <path>

n/a

Path to the trusted server certificate; initiates an encrypted connection

--delim <delimiter>

comma (,)

Field delimiter, in single quotes

--line <delimiter>

newline (\n)

Line delimiter, in single quotes

--batch <batch_size>

10000

Number of records in a batch

--retry_count <retry_number>

10

Number of attempts before job fails

--retry_wait <seconds>

5

Wait time in seconds after server connection failure

--null <string>

n/a

String that represents null values

--quoted <boolean>

false

Whether the source contains quoted fields

`-t

--transform`

n/a

Regex transformation

--print_error

false

Print error messages

--print_transform

false

Print description of transform

--help

n/a

List options

--group-id <id>

n/a

Kafka group ID

--topic <topic>

n/a

The Kafka topic to be ingested

--brokers <broker_name:broker_port>

localhost:9092

One or more brokers

KafkaImporter Logging Options

KafkaImporter Logging Options

Setting

Default

Description

--log-directory <directory>

mapd_log

Logging directory; can be relative to data directory or absolute

--log-file-name <filename>

n/a

Log filename relative to logging directory; has format KafkaImporter.{SEVERITY}.%Y%m%d-%H%M%S.log

--log-symlink <symlink>

n/a

Symlink to active log; has format KafkaImporter.{SEVERITY}

--log-severity <level>

INFO

Log-to-file severity level: INFO, WARNING, ERROR, or FATAL

--log-severity-clog <level>

ERROR

Log-to-console severity level: INFO, WARNING, ERROR, or FATAL

--log-channels

n/a

Log channel debug info

--log-auto-flush

n/a

Flush logging buffer to file after each message

--log-max-files <files_number>

100

Maximum number of log files to keep

--log-min-free-space <bytes>

20,971,520

Minimum number of bytes available on the device before oldest log files are deleted

--log-rotate-daily

1

Start new log files at midnight

--log-rotation-size <bytes>

10485760

Maximum file size, in bytes, before new log files are created

Configure KafkaImporter to use your target table. KafkaImporter listens to a pre-defined Kafka topic associated with your table. You must create the table before using the KafkaImporter utility. For example, you might have a table named customer_site_visit_events that listens to a topic named customer_site_visit_events_topic.

The data format must be a record-level format supported by HEAVY.AI.

KafkaImporter listens to the topic, validates records against the target schema, and ingests topic batches of your designated size to the target table. Rejected records use the existing reject reporting mechanism. You can start, shut down, and configure KafkaImporter independent of the HeavyDB engine. If KafkaImporter is running and the database shuts down, KafkaImporter shuts down as well. Reads from the topic are nondestructive.

KafkaImporter is not responsible for event ordering; a streaming platform outside HEAVY.AI (for example, Spark streaming, flink) should handle the stream processing. HEAVY.AI ingests the end-state stream of post-processed events.

KafkaImporter does not handle dynamic schema creation on first ingest, but must be configured with a specific target table (and its schema) as the basis. There is a 1:1 correspondence between target table and topic.

cat tweets.tsv | -./KafkaImporter tweets_small heavyai-u imauser-p imapassword--delim '\t'--batch 100000--retry_count 360--retry_wait 10--null null--port 9999--brokers=localhost:9092--group-id=testImport1--topic=tweet
cat tweets.tsv | ./KafkaImporter tweets_small heavyai
-u imauser
-p imapassword
--delim '\t'
--batch 100000
--retry_count 360
--retry_wait 10
--null null
--port 9999
--brokers=localhost:9092
--group-id=testImport1
--topic=tweet

StreamImporter

StreamImporter is an updated version of the StreamInsert utility used for streaming reads from delimited files into HeavyDB. StreamImporter uses a binary columnar load path, providing improved performance compared to StreamInsert.

You can ingest data from a data stream to an existing table in HEAVY.AI using StreamImporter on the command line.

StreamImporter <table_name> <database_name> {-u|--user <user_name> \
{-p|--passwd <user_password>} [{--host} <hostname>] [--port <HeavyDB_port>] \
[--http] [--https] [--skipverify] [--ca-cert <path>] [--delim <delimiter>] \
[--null <null string>] [--line <line delimiter>]  [--quoted <boolean>] \
 [--batch <batch_size>] [{-t|--transform} transformation ...] \
[retry_count <number_of_retries>] [--retry_wait <delay_in_seconds>]  \
[--print_error] [--print_transform]

StreamImporter Options

Setting

Default

Description

<table_name>

n/a

Name of the target table in OmniSci

<database_name>

n/a

Name of the target database in OmniSci

-u <username>

n/a

User name

-p <password>

n/a

User password

--host <hostname>

n/a

Name of OmniSci host

--port <port>

6274

Port number for OmniSciDB on localhost

--http

n/a

Use HTTP transport

--https

n/a

Use HTTPS transport

--skip-verify

n/a

Do not verify validity of SSL certificate

--ca-cert <path>

n/a

Path to the trusted server certificate; initiates an encrypted connection

--delim <delimiter>

comma (,)

Field delimiter, in single quotes

--null <string>

n/a

String that represents null values

--line <delimiter>

newline (\n)

Line delimiter, in single quotes

--quoted <boolean>

true

Either true or false, indicating whether the input file contains quoted fields.

--batch <number>

10000

Number of records in a batch

--retry_count <retry_number>

10

Number of attempts before job fails

--retry_wait <seconds>

5

Wait time in seconds after server connection failure

`-t

--transform`

n/a

Regex transformation

--print_error

false

Print error messages

--print_transform

false

Print description of transform

--help

n/a

List options

StreamImporter Logging Options

Setting

Default

Description

--log-directory <directory>

mapd_log

Logging directory; can be relative to data directory or absolute

--log-file-name <filename>

n/a

Log filename relative to logging directory; has format StreamImporter.{SEVERITY}.%Y%m%d-%H%M%S.log

--log-symlink <symlink>

n/a

Symlink to active log; has format StreamImporter.{SEVERITY}

--log-severity <level>

INFO

Log-to-file severity level: INFO, WARNING, ERROR, or FATAL

--log-severity-clog <level>

ERROR

Log-to-console severity level: INFO, WARNING, ERROR, or FATAL

--log-channels

n/a

Log channel debug info

--log-auto-flush

n/a

Flush logging buffer to file after each message

--log-max-files <files_number>

100

Maximum number of log files to keep

--log-min-free-space <bytes>

20,971,520

Minimum number of bytes available on the device before oldest log files are deleted

--log-rotate-daily

1

Start new log files at midnight

--log-rotation-size <bytes>

10485760

Maximum file size, in bytes, before new log files are created

Configure StreamImporter to use your target table. StreamImporter listens to a pre-defined data stream associated with your table. You must create the table before using the StreamImporter utility.

The data format must be a record-level format supported by HEAVY.AI.

StreamImporter listens to the stream, validates records against the target schema, and ingests batches of your designated size to the target table. Rejected records use the existing reject reporting mechanism. You can start, shut down, and configure StreamImporter independent of the HeavyDB engine. If StreamImporter is running but the database shuts down, StreamImporter shuts down as well. Reads from the stream are non-destructive.

StreamImporter is not responsible for event ordering - a first class streaming platform outside HEAVY.AI (for example, Spark streaming, flink) should handle the stream processing. HEAVY.AI ingests the end-state stream of post-processed events.

StreamImporter does not handle dynamic schema creation on first ingest, but must be configured with a specific target table (and its schema) as the basis.

There is a 1:1 correspondence between target table and a stream record.

cat tweets.tsv | ./StreamImporter tweets_small heavyai
-u imauser
-p imapassword
--delim '\t'
--batch 100000
--retry_count 360
--retry_wait 10
--null null
--port 9999

Importing Data from HDFS with Sqoop

You can consume a CSV or Parquet file residing in HDFS (Hadoop Distributed File System) into HeavyDB.

Copy the HEAVY.AI JDBC driver into the Apache Sqoop library, normally found at /usr/lib/sqoop/lib/.

Example

sqoop-export --table iAmATable \
--export-dir /user/cloudera/ \
--connect "jdbc:heavyai:000.000.000.0:6274:heavyai" \
--driver com.heavyai.jdbc.HeavyaiDriver \
--username imauser \
--password imapassword \
--direct \
--batch

The --connect parameter is the address of a valid JDBC port on your HEAVY.AI instance.

Troubleshooting: Avoiding Duplicate Rows

To detect duplication prior to loading data into HeavyDB, you can perform the following steps. For this example, the files are labeled A,B,C...Z.

  1. Load file A into table MYTABLE.

  2. Run the following query.

    select count(t1.uniqueCol) as dups from MYTABLE t1 join MYTABLE t2 on t1.uCol = t2.uCol;

    There should be no rows returned; if rows are returned, your first A file is not unique.

  3. Load file B into table TEMPTABLE.

  4. Run the following query.

    select count(t1.uniqueCol) as dups from MYTABLE t1 join MYTABLE t2 on t1.uCol = t2.uCol;

    There should be no rows returned if file B is unique. Fix B if the information is not unique using details from the selection.

  5. Load the fixed B file into MYFILE.

  6. Drop table TEMPTABLE.

  7. Repeat steps 3-6 for the rest of the set for each file prior to loading the data to the real MYTABLE instance.

Users and Databases

DDL - Users and Databases

HEAVY.AI has a default superuser named admin with default password HyperInteractive.

When you create or alter a user, you can grant superuser privileges by setting the is_super property.

You can also specify a default database when you create or alter a user by using the default_db property. During login, if a database is not specified, the server uses the default database assigned to that user. If no default database is assigned to the user and no database is specified during login, the heavyai database is used.

When an administrator, superuser, or owner drops or renames a database, all current active sessions for users logged in to that database are invalidated. The users must log in again.

Similarly, when an administrator or superuser drops or renames a user, all active sessions for that user are immediately invalidated.

If a password includes characters that are nonalphanumeric, it must be enclosed in single quotes when logging in to heavysql. For example: $HEAVYAI_PATH/bin/heavysql heavyai -u admin -p '77Heavy!9Ai'

Nomenclature Constraints

  • A NAME is [A-Za-z_][A-Za-z0-9\$_]*

  • A DASHEDNAME is [A-Za-z_][A-Za-z0-9\$_\-]*

  • An EMAIL is ([^[:space:]\"]+|\".+\")@[A-Za-z0-9][A-Za-z0-9\-\.]*\.[A-Za-z]+

User objects can use NAME, DASHEDNAME, or EMAIL format.

Role objects must use either NAME or DASHEDNAME format.

Database and column objects must use NAME format.

CREATE USER

CREATE USER ["]<name>["] (<property> = value,...);

HEAVY.AI accepts (almost) any string enclosed in optional double quotation marks as the user name.

Property

Value

password

User's password.

is_super

Set to true if user is a superuser. Default is false.

default_db

User's default database on login.

can_login

Set to true (default/implicit) to activate a user.

When false, the user still retains all defined privileges and configuration settings, but cannot log in to HEAVY.AI. Deactivated users who try to log in receive the error message "Unauthorized Access: User is deactivated."

Examples:

CREATE USER jason (password = 'HeavyaiRocks!', is_super = 'true', default_db='tweets');
CREATE USER "pembroke.q.aloysius" (password= 'HeavyaiRolls!', default_db='heavyai');

DROP USER

DROP USER [IF EXISTS] ["]<name>["];

Example:

DROP USER [IF EXISTS] jason;
DROP USER "pemboke.q.aloysius";

ALTER USER

ALTER USER ["]<name>["] (<property> = value, ...);
ALTER USER ["]<oldUserName>["] RENAME TO ["]<newUserName>["];

HEAVY.AI accepts (almost) any string enclosed in optional double quotation marks as the old or new user name.

Property

Value

password

User's password.

is_super

Set to true if user is a superuser. Default is false.

default_db

User's default database on login.

can_login

Set to true (default/implicit) to activate a user.

When false, the user still retains all defined privileges and configuration settings, but cannot log in to HEAVY.AI. Deactivated users who try to log in receive the error message "Unauthorized Access: User is deactivated."

Example:

ALTER USER admin (password = 'HeavyaiIsFast!');
ALTER USER jason (is_super = 'false', password = 'SilkySmooth', default_db='traffic');
ALTER USER methuselah RENAME TO aurora;
ALTER USER "pembroke.q.aloysius" RENAME TO "pembroke.q.murgatroyd";
ALTER USER chumley (can_login='false');

CREATE DATABASE

CREATE DATABASE [IF NOT EXISTS] <name> (<property> = value, ...);

Database names cannot include quotes, spaces, or special characters.

In Release 6.3.0 and later, database names are case insensitive. Duplicate database names will cause a failure when attempting to start HeavyDB 6.3.0 or higher. Check database names and revise as necessary to avoid duplicate names.

Property

Value

owner

User name of the database owner.

Example:

CREATE DATABASE test (owner = 'jason');

DROP DATABASE

DROP DATABASE [IF EXISTS] ;

Example:

DROP DATABASE IF EXISTS test;

ALTER DATABASE

ALTER DATABASE <name> RENAME TO <name>;

To alter a database, you must be the owner of the database or an HeavyDB superuser.

Example:

ALTER DATABASE curmudgeonlyOldDatabase RENAME TO ingenuousNewDatabase;

ALTER DATABASE OWNER TO

Enable super users to change the owner of a database.

ALTER DATABASE <database name> OWNER TO <new_owner>;

Example

Change the owner of my_database to user Joe:

ALTER DATABASE my_database OWNER TO Joe;

Only superusers can run the ALTER DATABASE OWNER TO command.

REASSIGN OWNED

REASSIGN [ALL] OWNED BY <old_owner>, <old_owner>, ... TO <new_owner>

Changes ownership of database objects (tables, views, dashboards, etc.) from a user or set of users to a different user. When the ALL keyword is specified, ownership change would apply to database objects across all databases. Otherwise, ownership change only applies to database objects in the current database.

Example: Reassign database objects owned by jason and mike in the current database to joe.

REASSIGN OWNED BY jason, mike TO joe;

Example: Reassign database objects owned by jason and mike across all databases to joe.

REASSIGN ALL OWNED BY jason, mike TO joe;

Database object ownership changes only for objects within the database; ownership of the database itself is not affected. You must be a superuser to run this command.

Database Security Example

Importing Geospatial Data

If there is a potential for duplicate entries and you want to avoid loading duplicate rows, see How can I avoid creating duplicate rows? on the Troubleshooting page.

Importing Geospatial Data Using Heavy Immerse

You can use Heavy Immerse to import geospatial data into HeavyDB.

Supported formats include:

  • Keyhole Markup Language (.kml)

  • GeoJSON (.geojson)

  • Shapefiles (.shp)

  • FlatGeobuf (.fgb)

Shapefiles include four mandatory files: .shp, .shx, .dbf, and .prj. If you do not import the .prj file, the coordinate system will be incorrect and you cannot render the shapes on a map.

To import geospatial definition data:

  1. Open Heavy Immerse.

  2. Click Data Manager.

  3. Click Import Data.

  4. Click the large + icon to select files for upload, or drag and drop the files to the Data Importer screen.

    When importing shapefiles, upload all required file types at the same time. If you upload them separately, Heavy Immerse issues an error message.

  5. Wait for the uploads to complete (indicated by green checkmarks on the file icons), then click Preview.

  6. On the Data Preview screen:

    • Edit the column headers (if needed).

    • Enter a name for the table in the field at the bottom of the screen.

    • If you are loading the data files into a distributed system, verify under Import Settings that the Replicate Table checkbox is selected.

    • Click Import Data.

  7. On the Successfully Imported Table screen, verify the rows and columns that compose your data table.

Importing Well-Known Text

When representing longitude and latitude in HEAVY.AI geospatial primitives, the first coordinate is assumed to be longitude by default.

WKT Data Supported in Geospatial Columns

You can use heavysql to define tables with columns that store WKT geospatial objects.

heavysql> \d geo
CREATE TABLE geo (
p POINT,
l LINESTRING,
poly POLYGON)

Insert

You can use heavysql to insert data as WKT string values.

heavysql> INSERT INTO geo values('POINT(20 20)', 'LINESTRING(40 0, 40 40)', 
'POLYGON(( 0 0, 40 0, 40 40, 0 40, 0 0 ))');

Importing Delimited Files

You can insert data from CSV/TSV files containing WKT strings. HEAVY.AI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI.

> cat geo.csv
"p", "l", "poly"
"POINT(1 1)", "LINESTRING( 2 0,  2  2)", "POLYGON(( 1 0,  0 1, 1 1 ))"
"POINT(2 2)", "LINESTRING( 4 0,  4  4)", "POLYGON(( 2 0,  0 2, 2 2 ))"
"POINT(3 3)", "LINESTRING( 6 0,  6  6)", "POLYGON(( 3 0,  0 3, 3 3 ))"
"POINT(4 4)", "LINESTRING( 8 0,  8  8)", "POLYGON(( 4 0,  0 4, 4 4 ))"
heavysql> COPY geo FROM 'geo.csv';
Result
Loaded: 4 recs, Rejected: 0 recs in 0.356000 secs

You can use your own custom delimiter in your data files.

> cat geo1.csv
"p", "l", "poly"
POINT(5 5); LINESTRING(10 0, 10 10); POLYGON(( 5 0, 0 5, 5 5 ))
heavysql> COPY geo FROM 'geo1.csv' WITH (delimiter=';', quoted='false');
Result
Loaded: 1 recs, Rejected: 0 recs in 0.148000 secs

Importing Legacy CSV/TSV Files

Storing Geo Data

You can import CSV and TSV files for tables that store longitude and latitude as either:

  • Separate consecutive scalar columns

  • A POINT field.

If the data is stored as a POINT, you can use spatial functions like ST_Distance and ST_Contains. When location data are stored as a POINT column, they are displayed as such when querying the table:

select * from destination_points;
name|pt
Just Fishing Around|POINT (-85.499999999727588 44.6929999755849)
Moonlight Cove Waterfront|POINT (-85.5046011346879 44.6758447935227)

If two geometries are used in one operation (for example, in ST_Distance), the SRID values need to match.

Importing the Data

If you are using heavysql, create the table in HEAVY.AI with the POINT field defined as below:

CREATE TABLE new_geo (p GEOMETRY(POINT,4326))

Then, import the file using COPY FROM in heavysql. By default, the two columns as consumed as longitude x and then latitude y. If the order of the coordinates in the CSV file is reversed, load the data using the WITH option lonlat='false':

heavysql> COPY new_geo FROM 'legacy_geo.csv' WITH (lonlat='false');

Columns can exist on either side of the point field; the lon/lat in the source file does not have to be at the beginning or end of the target table. Fields can exist on either side of the lon/lat pair.

If the imported coordinates are not 4326---for example, 2263---you can transform them to 4326 on the fly:

heavysql> COPY new_geo FROM 'legacy_geo_2263.csv' WITH (source_srid=2263, lonlat='false');

Importing CSV, TSV, and TXT Files in Immerse

In Immerse, you define the table when loading the data instead of predefining it before import. Immerse supports appending data to a table by loading one or more files.

Longitude and latitude can be imported as separate columns.

Importing Geospatial Files

You can create geo tables by importing specific geo file formats. HEAVY.AI supports the following types:

  • ESRI shapefile (.shp and associated files)

  • GeoJSON (.geojson or .json)

  • KML (.kml or .kmz)

  • ESRI file geodatabase (.gdb)

You import geo files using the COPY FROM command with the geo option:

heavysql> COPY states FROM 'states.shp' WITH (geo='true');
heavysql> COPY zipcodes FROM 'zipcodes.geojson' WITH (geo='true');
heavysql> COPY cell_towers FROM 'cell_towers.kml' WITH (geo='true');

The geo file import process automatically creates the table by detecting the column names and types explicitly described in the geo file header. It then creates a single geo column (always called heavyai_geo) that is of one of the supported types (POINT, MULTIPOINT, LINESTRING, MULTILINESTRING, POLYGON, or MULTIPOLYGON).

In Release 6.2 and higher, polygon render metadata assignment is disabled by default. This data is no longer required by the new polygon rendering algorithm introduced in Release 6.0. The new default results in significantly faster import for polygon table imports, particularly high-cardinality tables.

If you need to revert to the legacy polygon rendering algorithm, polygons from tables imported in Release 6.2 may not render correctly. Those tables must be re-imported after setting the server configuration flag enable-assign-render-groups to true.

The legacy polygon rendering algorithm and polygon render metadata server config will be removed completely in an upcoming release.

Due to the prevalence of mixed POLYGON/MULTIPOLYGON geo files (and CSVs), if HEAVY.AI detects a POLYGON type geo file, HEAVY.AI creates a MULTIPOLYGON column and imports the data as single polygons.

If the table does not already exist, it is created automatically.

If the table already exists, and the data in the geo file has exactly the same column structure, the new file is appended to the existing table. This enables import of large geo data sets split across multiple files. The new file is rejected if it does not have the same column structure.

By default, geo data is stored as GEOMETRY.

You can also create tables with coordinates in SRID 3857 or SRID 900913 (Google Web Mercator). Importing data from shapefiles using SRID 3857 or 900913 is supported; importing data from delimited files into tables with these SRIDs is not supported at this time. To explicitly store in other formats, use the following WITH options in addition to geo='true':

Compression used:

  • COMPRESSED(32) - 50% compression (default)

  • None - No compression

Spatial reference identifier (SRID) type:

  • 4326 - EPSG:4326 (default)

  • 900913 - Google Web Mercator

  • 3857 - EPSG:3857

For example, the following explicitly sets the default values for encoding and SRID:

geo_coords_encoding='COMPRESSED(32)'
geo_coords_srid=4326

Note that rendering of geo MULTIPOINT is not yet supported.

Importing an ESRI File Geodatabase

An ESRI file geodatabase (.gdb) provides a method of storing GIS information in one large file that can have one or more "layers", with each layer containing disparate but related data. The data in each layer can be of different types. Importing a .gdb file results in the creation of one table for each layer in the file. You import an ESRI file geodatabase the same way that you import other geo file formats, using the COPY FROM command with the geo option:

heavysql> COPY counties FROM 'counties.gdb' WITH (geo='true');

The layers in the file are scanned and defined by name and contents. Contents are classified as EMPTY, GEO, NON_GEO or UNSUPPORTED_GEO:

  • EMPTY layers are skipped because they contain no useful data.

  • GEO layers contain one or more geo columns of a supported type (POINT, MULTIPOINT, LINESTRING, MULTILINESTRING, POLYGON, MULTIPOLYGON) and one or more regular columns, and can be imported to a single table in the same way as the other geo file formats.

  • NON_GEO layers contain no geo columns and one or more regular columns, and can be imported to a regular table. Although the data comes from a geo file, data in this layer does not result in a geo table.

  • UNSUPPORTED_GEO layers contain geo columns of a type not currently supported (for example, GEOMETRYCOLLECTION). These layers are skipped because they cannot be imported completely.

A single COPY FROM command can result in multiple tables, one for each layer in the file. The table names are automatically generated by appending the layer name to the provided table name.

For example, consider the geodatabase file mydata.gdb which contains two importable layers with names A and B. Running COPY FROM creates two tables, mydata_A and mydata_B, with the data from layers A and B, respectively. The layer names are appended to the provided table name. If the geodatabase file only contains one layer, the layer name is not appended.

You can load one specific layer from the geodatabase file by using the geo_layer_name option:

COPY mydata FROM 'mydata.gdb' WITH (geo='true', geo_layer_name='A');

This loads only layer A, if it is importable. The resulting table is called mydata, and the layer name is not appended. Use this import method if you want to set a different name for each table. If the layer name from the geodatabase file would result in an illegal table name when appended, the name is sanitized by removing any illegal characters.

Importing Geo Files from Archives or Non-Local Storage

You can import geo files directly from archive files (for example, .zip .tar .tgz .tar.gz) without unpacking the archive. You can directly import individual geo files compressed with Zip or GZip (GeoJSON and KML only). The server opens the archive header and loads the first candidate file it finds (.shp .geojson .json .kml), along with any associated files (in the case of an ESRI Shapefile, the associated files must be siblings of the first).

$ unzip -l states.zip
Archive:  states.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2018-02-13 11:09   states/
   446116  2017-11-06 12:15   states/cb_2014_us_state_20m.shp
     8434  2017-11-06 12:15   states/cb_2014_us_state_20m.dbf
        9  2017-11-06 12:15   states/cb_2014_us_state_20m.cpg
      165  2017-11-06 12:15   states/cb_2014_us_state_20m.prj
      516  2017-11-06 12:15   states/cb_2014_us_state_20m.shx
---------                     -------
   491525                     6 files

heavysql> COPY states FROM 'states.zip' with (geo='true');
heavysql> COPY zipcodes FROM 'zipcodes.geojson.gz' with (geo='true');
heavysql> COPY zipcodes FROM 'zipcodes.geojson.zip' with (geo='true');
heavysql> COPY cell_towers FROM 'cell_towers.kml.gz' with (geo='true');

You can import geo files or archives directly from an Amazon S3 bucket.

heavysql> COPY states FROM 's3://mybucket/myfolder/states.shp' with (geo='true');
heavysql> COPY states FROM 's3://mybucket/myfolder/states.zip' with (geo='true');
heavysql> COPY zipcodes FROM 's3://mybucket/myfolder/zipcodes.geojson.gz' with (geo='true');
heavysql> COPY zipcodes FROM 's3://mybucket/myfolder/zipcodes.geojson.zip' with (geo='true');

You can provide Amazon S3 credentials, if required, by setting variables in the environment of the heavysql process…

AWS_REGION=us-west-1
AWS_ACCESS_KEY_ID=********************
AWS_SECRET_ACCESS_KEY=****************************************

You can also provide your credentials explicitly in the COPY FROM command.

heavysql> COPY states FROM 's3://mybucket/myfolder/states.zip' WITH (geo='true', s3_region='us-west-1', s3_access_key='********************', s3_secret_key='****************************************');  

You can import geo files or archives directly from an HTTP/HTTPS website.

heavysql> COPY states FROM 'http://www.mysite.com/myfolder/states.zip' with (geo='true');

You can extend a column type specification to include spatial reference (SRID) and compression mode information.

Geospatial objects declared with SRID 4326 are compressed 50% by default with ENCODING COMPRESSED(32). In the following definition of table geo2, the columns poly2 and mpoly2 are compressed.

COMPRESSED(32) compression maps lon/lat degree ranges to 32-bit integers, providing a smaller memory footprint and faster query execution. The effect on precision is small, approximately 4 inches at the equator.

You can disable compression by explicitly choosing ENCODING NONE.

WGS84 Coordinate Compression

You can extend a column type specification to include spatial reference (SRID) and compression mode information.

Geospatial objects declared with SRID 4326 are compressed 50% by default with ENCODING COMPRESSED(32). In the following definition of table geo2, the columns poly2 and mpoly2 are compressed.

CREATE TABLE geo2 (
p2 GEOMETRY(POINT, 4326) ENCODING NONE,
l2 GEOMETRY(LINESTRING, 900913),
poly2 GEOMETRY(POLYGON, 4326),
mpoly2 GEOMETRY(MULTIPOLYGON, 4326) ENCODING COMPRESSED(32));

COMPRESSED(32) compression maps lon/lat degree ranges to 32-bit integers, providing a smaller memory footprint and faster query execution. The effect on precision is small, approximately 4 inches at the equator.

You can disable compression by explicitly choosing ENCODING NONE.

System Tables

HeavyDB system tables provide a way to access information about database objects, database object permissions, and system resource (storage, CPU, and GPU memory) utilization. These system tables can be found in the information_schema database that is available by default on server startup. You can query system tables in the same way as regular tables, and you can use the SHOW CREATE TABLE command to view the table schemas.

Users

The users system table provides information about all database users and contains the following columns:

Column Name

Column Type

Description

user_id

INTEGER

ID of database user.

user_name

TEXT

Username of database user.

is_super_user

BOOLEAN

Indicates whether or not the database user is a super user.

default_db_id

INTEGER

ID of user’s default database on login.

default_db_name

TEXT

Name of user’s default database on login.

can_login

BOOLEAN

Indicates whether or not the database user account is activated and can log in.

Databases

The databases system table provides information about all created databases on the server and contains the following columns:

Column Name

Column Type

Description

database_id

INTEGER

ID of database.

database_name

TEXT

Name of database.

owner_id

INTEGER

User ID of database owner.

owner_user_name

TEXT

Username of database owner.

Permissions

The permissions system table provides information about all user/role permissions for all database objects and contains the following columns:

Column Name

Column Type

Description

role_name

TEXT

Username or role name associated with permission.

is_user_role

BOOLEAN

Boolean indicating whether or not the role_name column identifies a user or a role.

database_id

INTEGER

ID of database that contains the database object for which permission was granted.

database_name

TEXT

Name of database that contains the database object on which permission was granted.

object_name

TEXT

Name of database object on which permission was granted.

object_id

INTEGER

ID of database object on which permission was granted.

object_owner_id

INTEGER

User id of the owner of the database object on which permission was granted.

object_owner_user_name

TEXT

Username of the owner of the database object on which permission was granted.

object_permission_type

TEXT

Type of database object on which permission was granted.

object_permissions

TEXT[]

List of permissions that were granted on database object.

Roles

The roles system table lists all created database roles and contains the following columns:

Column Name

Column Type

Description

role_name

TEXT

Role name.

Tables

The tables system table provides information about all database tables and contains the following columns:

Column Name

Column Type

Description

database_id

INTEGER

ID of database that contains the table.

database_name

TEXT

Name of database that contains the table.

table_id

INTEGER

Table ID.

table_name

TEXT

Table name.

owner_id

INTEGER

User ID of table owner.

owner_user_name

TEXT

Username of table owner.

column_count

INTEGER

Number of table columns. Note that internal system columns are included in this count.

table_type

TEXT

Type of tables. Possible values are DEFAULT, VIEW, TEMPORARY , and FOREIGN.

view_sql

TEXT

For views, SQL statement used in the view.

max_fragment_size

INTEGER

Number of rows per fragment used by the table.

max_chunk_size

BIGINT

Maximum size (in bytes) of table chunks.

fragment_page_size

INTEGER

Size (in bytes) of table data pages.

max_rows

BIGINT

Maximum number of rows allowed by table.

max_rollback_epochs

INTEGER

Maximum number of epochs a table can be rolled back to.

shard_count

INTEGER

Number of shards that exists for table.

ddl_statement

TEXT

CREATE TABLE DDL statement for table.

Dashboards

The dashboards system table provides information about created dashboards (enterprise edition only) and contains the following columns:

Column Name

Column Type

Description

database_id

INTEGER

ID of database that contains the dashboard.

database_name

TEXT

Name of database that contains the dashboard.

dashboard_id

INTEGER

Dashboard ID.

dashboard_name

TEXT

Dashboard name.

owner_id

INTEGER

User ID of dashboard owner.

owner_user_name

TEXT

Username of dashboard owner.

last_updated_at

TIMESTAMP

Timestamp of last dashboard update.

data_sources

TEXT[]

List to data sources/tables used by dashboard.

Role Assignments

The role_assignments system table provides information about database roles that have been assigned to users and contains the following columns:

Column Name

Column Type

Description

role_name

TEXT

Name of assigned role.

user_name

TEXT

Username of user that was assigned the role.

Memory Summary

The memory_summary system table provides high level information about utilized memory across CPU and GPU devices and contains the following columns:

Column Name

Column Type

Description

node

TEXT

Node from which memory information is fetched.

device_id

INTEGER

Device ID.

device_type

TEXT

Type of device. Possible values are CPU and GPU.

max_page_count

BIGINT

Maximum number of memory pages that can be allocated on the device.

page_size

BIGINT

Size (in bytes) of a memory page on the device.

allocated_page_count

BIGINT

Number of allocated memory pages on the device.

used_page_count

BIGINT

Number of used allocated memory pages on the device.

free_page_count

BIGINT

Number of free allocated memory pages on the device.

Memory Details

The memory_details system table provides detailed information about allocated memory segments across CPU and GPU devices and contains the following columns:

Column Name

Column Type

Description

node

TEXT

Node from which memory information is fetched.

database_id

INTEGER

ID of database that contains the table that memory was allocated for.

database_name

TEXT

Name of database that contains the table that memory was allocated for.

table_id

INTEGER

ID of table that memory was allocated for.

table_name

TEXT

Name of table that memory was allocated for.

column_id

INTEGER

ID of column that memory was allocated for.

column_name

TEXT

Name of column that memory was allocated for.

chunk_key

INTEGER[]

ID of cached table chunk.

device_id

INTEGER

Device ID.

device_type

TEXT

Type of device. Possible values are CPU and GPU.

memory_status

TEXT

Memory segment use status. Possible values are FREE and USED.

page_count

BIGINT

Number pages in the segment.

page_size

BIGINT

Size (in bytes) of a memory page on the device.

slab_id

INTEGER

ID of slab containing memory segment.

start_page

BIGINT

Page number of the first memory page in the segment.

last_touched_epoch

BIGINT

Epoch at which the segment was last accessed.

Storage Details

The storage_details system table provides detailed information about utilized storage per table and contains the following columns:

Column Name

Column Type

Description

node

TEXT

Node from which storage information is fetched.

database_id

INTEGER

ID of database that contains the table.

database_name

TEXT

Name of database that contains the table.

table_id

INTEGER

Table ID.

table_name

TEXT

Table Name.

epoch

INTEGER

Current table epoch.

epoch_floor

INTEGER

Minimum epoch table can be rolled back to.

fragment_count

INTEGER

Number of table fragments.

shard_id

INTEGER

Table shard ID. This value is only set for sharded tables.

data_file_count

INTEGER

Number of data files created for table.

metadata_file_count

INTEGER

Number of metadata files created for table.

total_data_file_size

BIGINT

Total size (in bytes) of data files.

total_data_page_count

BIGINT

Total number of pages across all data files.

total_free_data_page_count

BIGINT

Total number of free pages across all data files.

total_metadata_file_size

BIGINT

Total size (in bytes) of metadata files.

total_metadata_page_count

BIGINT

Total number of pages across all metadata files.

total_free_metadata_page_count

BIGINT

Total number of free pages across all metadata files.

total_dictionary_data_file_size

BIGINT

Total size (in bytes) of string dictionary files.

Log-Based System Tables

Log-based system tables are considered beta functionality in Release 6.1.0 and are disabled by default.

Request Logs

The request_logs system table provides information about HeavyDB Thrift API requests and contains the following columns:

Column Name

Column Type

Description

log_timestamp

TIMESTAMP

Timestamp of log entry.

severity

TEXT

Severity level of log entry. Possible values are F (fatal), E (error), W (warning), and I (info).

process_id

INTEGER

Process ID of the HeavyDB instance that generated the log entry.

query_id

INTEGER

ID associated with a SQL query. A value of 0 indicates that either the log entry is unrelated to a SQL query or no query ID has been set for the log entry.

thread_id

INTEGER

ID of thread that generated the log entry.

file_location

TEXT

Source file name and line number where the log entry was generated.

api_name

TEXT

Name of Thrift API that the request was sent to.

request_duration_ms

BIGINT

Thrift API request duration in milliseconds.

database_name

TEXT

Request session database name.

user_name

TEXT

Request session username.

public_session_id

TEXT

Request session ID.

query_string

TEXT

Query string for SQL query requests.

client

TEXT

Protocol and IP address of client making the request.

dashboard_id

INTEGER

Dashboard ID for SQL query requests coming from Immerse dashboards.

dashboard_name

TEXT

Dashboard name for SQL query requests coming from Immerse dashboards.

chart_id

INTEGER

Chart ID for SQL query requests coming from Immerse dashboards.

execution_time_ms

BIGINT

Execution time in milliseconds for SQL query requests.

total_time_ms

BIGINT

Total execution time (execution_time_ms + serialization time) in milliseconds for SQL query requests.

Server Logs

The server_logs system table provides HeavyDB server logs in tabular form and contains the following columns:

Column Name

Column Type

Description

node

TEXT

Node containing logs.

log_timestamp

TIMESTAMP

Timestamp of log entry.

severity

TEXT

Severity level of log entry. Possible values are F (fatal), E (error), W (warning), and I (info).

process_id

INTEGER

Process ID of the HeavyDB instance that generated the log entry.

query_id

INTEGER

ID associated with a SQL query. A value of 0 indicates that either the log entry is unrelated to a SQL query or no query ID has been set for the log entry.

thread_id

INTEGER

ID of thread that generated the log entry.

file_location

TEXT

Source file name and line number where the log entry was generated.

message

TEXT

Log message.

Web Server Logs

The web_server_logs system table provides HEAVY.AI Web Server logs in tabular form and contains the following columns (Enterprise Edition only):

Column Name

Column Type

Description

log_timestamp

TIMESTAMP

Timestamp of log entry.

severity

TEXT

Severity level of log entry. Possible values are fatal, error, warning, and info.

message

TEXT

Log message.

Web Server Access Logs

Column Name

Column Type

Description

ip_address

TEXT

IP address of client making the web server request.

log_timestamp

TIMESTAMP

Timestamp of log entry.

http_method

TEXT

HTTP request method.

endpoint

TEXT

Web server request endpoint.

http_status

SMALLINT

HTTP response status code.

response_size

BIGINT

Response payload size in bytes.

Refreshing Logs System Tables

The logs system tables must be refreshed manually to view new log entries. You can run the REFRESH FOREIGN TABLES SQL command (for example, REFRESH FOREIGN TABLES server_logs, request_logs; ), or click the Refresh Data Now button on the table’s Data Manager page in Heavy Immerse.

Request Logs and Monitoring System Dashboard

The Request Logs and Monitoring system dashboard is built on the log-based system tables and provides visualization of request counts, performance, and errors over time, along with the server logs.

System Dashboards

Access to system dashboards is controlled using Heavy Immerse privileges; only users with Admin privileges or users/roles with access to the information_schema database can access the system dashboards.

Cross-linking must be enabled to allow cross-filtering across charts that use different system tables. Enable cross-linking by adding "ui/enable_crosslink_panel": true to the feature_flags section of the servers.json file.

Tables

DDL - Tables

These functions are used to create and modify data tables in HEAVY.AI.

Nomenclature Constraints

[A-Za-z_][A-Za-z0-9\$_]*

Table and column names can include quotes, spaces, and the underscore character. Other special characters are permitted if the name of the table or column is enclosed in double quotes (" ").

  • Spaces and special characters other than underscore (_) cannot be used in Heavy Immerse.

  • Column and table names enclosed in double quotes cannot be used in Heavy Immerse

CREATE TABLE

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] <table>
  (<column> <type> [NOT NULL] [DEFAULT <value>] [ENCODING <encodingSpec>],
  [SHARD KEY (<column>)],
  [SHARED DICTIONARY (<column>) REFERENCES <table>(<column>)], ...)
  [WITH (<property> = value, ...)];

Create a table named <table> specifying <columns> and table properties.

Supported Datatypes

Datatype

Size (bytes)

Notes

BIGINT

8

Minimum value: -9,223,372,036,854,775,807; maximum value: 9,223,372,036,854,775,807.

BOOLEAN

1

TRUE: 'true', '1', 't'. FALSE: 'false', '0', 'f'. Text values are not case-sensitive.

DATE*

4

Same as DATE ENCODING DAYS(32).

DATE ENCODING DAYS(32)

4

Range in years: +/-5,883,517 around epoch. Maximum date January 1, 5885487 (approximately). Minimum value: -2,147,483,648; maximum value: 2,147,483,647. Supported formats when using COPY FROM: mm/dd/yyyy, dd-mmm-yy, yyyy-mm-dd, dd/mmm/yyyy.

DATE ENCODING DAYS(16)

2

Range in days: -32,768 - 32,767 Range in years: +/-90 around epoch, April 14, 1880 - September 9, 2059. Minumum value: -2,831,155,200; maximum value: 2,831,068,800. Supported formats when using COPY FROM: mm/dd/yyyy, dd-mmm-yy, yyyy-mm-dd, dd/mmm/yyyy.

DATE ENCODING FIXED(32)

4

In DDL statements defaults to DATE ENCODING DAYS(16). Deprecated.

DATE ENCODING FIXED(16)

2

In DDL statements defaults to DATE ENCODING DAYS(16). Deprecated.

DECIMAL

2, 4, or 8

Takes precision and scale parameters: DECIMAL(precision,scale).

Size depends on precision:

  • Up to 4: 2 bytes

  • 5 to 9: 4 bytes

  • 10 to 18 (maximum): 8 bytes

Scale must be less than precision.

DOUBLE

8

Variable precision. Minimum value: -1.79 x e^308; maximum value: 1.79 x e^308.

FLOAT

4

Variable precision. Minimum value: -3.4 x e^38; maximum value: 3.4 x e^38.

INTEGER

4

Minimum value: -2,147,483,647; maximum value: 2,147,483,647.

SMALLINT

2

Minimum value: -32,767; maximum value: 32,767.

TEXT ENCODING DICT

4

Max cardinality 2 billion distinct string values

TEXT ENCODING NONE

Variable

Size of the string + 6 bytes

TIME

8

Minimum value: 00:00:00; maximum value: 23:59:59.

TIMESTAMP

8

Linux timestamp from -30610224000 (1/1/1000 00:00:00.000) through 29379542399 (12/31/2900 23:59:59.999).

Can also be inserted and stored in human-readable format:

  • YYYY-MM-DD HH:MM:SS

  • YYYY-MM-DDTHH:MM:SS (The T is dropped when the field is populated.)

TINYINT

1

Minimum value: -127; maximum value: 127.

Examples

Create a table named tweets and specify the columns, including type, in the table.

CREATE TABLE IF NOT EXISTS tweets (
   tweet_id BIGINT NOT NULL,
   tweet_time TIMESTAMP NOT NULL ENCODING FIXED(32),
   lat FLOAT,
   lon FLOAT,
   sender_id BIGINT NOT NULL,
   sender_name TEXT NOT NULL ENCODING DICT,
   location TEXT ENCODING  DICT,
   source TEXT ENCODING DICT,
   reply_to_user_id BIGINT,
   reply_to_tweet_id BIGINT,
   lang TEXT ENCODING  DICT,
   followers INT,
   followees INT,
   tweet_count INT,
   join_time TIMESTAMP ENCODING  FIXED(32),
   tweet_text TEXT,
   state TEXT ENCODING  DICT,
   county TEXT ENCODING DICT,
   place_name TEXT,
   state_abbr TEXT ENCODING DICT,
   county_state TEXT ENCODING DICT,
   origin TEXT ENCODING DICT,
   phone_numbers bigint);

Create a table named delta and assign a default value San Francisco to column city.

CREATE TABLE delta (
   id INTEGER NOT NULL, 
   name TEXT NOT NULL, 
   city TEXT NOT NULL DEFAULT 'San Francisco' ENCODING DICT(16));

Default values currently have the following limitations:

  • Only literals can be used for column DEFAULT values; expressions are not supported.

  • You cannot define a DEFAULT value for a shard key. For example, the following does not parse: CREATE TABLE tbl (id INTEGER NOT NULL DEFAULT 0, name TEXT, shard key (id)) with (shard_count = 2);

  • For arrays, use the following syntax: ARRAY[A, B, C, …. N]

    The syntax {A, B, C, ... N} is not supported.

  • Some literals, like NUMERIC and GEO types, are not checked at parse time. As a result, you can define and create a table with malformed literal as a default value, but when you try to insert a row with a default value, it will throw an error.

Supported Encoding

Encoding

Descriptions

DICT

Dictionary encoding on string columns (default for TEXT columns). Limit of 2 billion unique string values.

FIXED (bits)

NONE

No encoding. Valid only on TEXT columns. No Dictionary is created. Aggregate operations are not possible on this column type.

WITH Clause Properties

Property

Description

fragment_size

Number of rows per fragment that is a unit of the table for query processing. Default: 32 million rows, which is not expected to be changed.

max_rollback_epochs

Limit the number of epochs a table can be rolled back to. Limiting the number of epochs helps to limit the amount of on-disk data and prevent unmanaged data growth.

Limiting the number of rollback epochs also can increase system startup speed, especially for systems on which data is added in small batches or singleton inserts. Default: 3.

The following example creates the table test_table and sets the maximum epoch rollback number to 50:

CREATE TABLE test_table(a int) WITH (MAX_ROLLBACK_EPOCHS = 50);

max_rows

Used primarily for streaming datasets to limit the number of rows in a table, to avoid running out of memory or impeding performance. When the max_rows limit is reached, the oldest fragment is removed. When populating a table from a file, make sure that your row count is below the max_rows setting. If you attempt load more rows at one time than the max_rows setting defines, the records up to the max_rows limit are removed, leaving only the additional rows. Default: 2^62. In a distributed system, the maximum number of rows is calculated as max_rows * leaf_count. In a sharded distributed system, the maximum number of rows is calculated as max_rows * shard_count.

page_size

Number of I/O page bytes. Default: 1MB, which does not need to be changed.

partitions

Partition strategy option:

  • SHARDED: Partition table using sharding.

  • REPLICATED: Partition table using replication.

shard_count

Number of shards to create, typically equal to the number of GPUs across which the data table is distributed.

sort_column

Name of the column on which to sort during bulk import.

Sharding

Sharding partitions a database table across multiple servers so each server has a part of the table with the same columns but with different rows. Partitioning is based on a sharding key defined when you create the table.

Without sharding, the dimension tables involved in a join are replicated and sent to each GPU, which is not feasible for dimension tables with many rows. Specifying a shard key makes it possible for the query to execute efficiently on large dimension tables.

Currently, specifying a shard key is useful for joins, only:

  • If two tables specify a shard key with the same type and the same number of shards, a join on that key only sends a part of the dimension table column data to each GPU.

  • For multi-node installs, the dimension table does not need to be replicated and the join executes locally on each leaf.

Constraints

  • A shard key must specify a single column to shard on. There is no support for sharding by a combination of keys.

  • One shard key can be specified for a table.

  • Data are partitioned according to the shard key and the number of shards (shard_count).

  • A value in the column specified as a shard key is always sent to the same partition.

  • The number of shards should be equal to the number of GPUs in the cluster.

  • Sharding is allowed on the following column types:

    • DATE

    • INT

    • TEXT ENCODING DICT

    • TIME

    • TIMESTAMP

  • Tables must share the dictionary for the column to be involved in sharded joins. If the dictionary is not specified as shared, the join does not take advantage of sharding. Dictionaries are reference-counted and only dropped when the last reference drops.

Recommendations

  • Set shard_count to the number of GPUs you eventually want to distribute the data table across.

  • Referenced tables must also be shard_count -aligned.

  • Sharding should be minimized because it can introduce load skew accross resources, compared to when sharding is not used.

Examples

Basic sharding:

CREATE TABLE  customers(
   accountId text,
   name text,
   SHARD KEY (accountId))
  WITH (shard_count = 4);

Sharding with shared dictionary:

CREATE TABLE transactions(
   accountId text,
   action text,
   SHARD KEY (accountId),
   SHARED DICTIONARY (accountId) REFERENCES customers(accountId))
  WITH (shard_count = 4);

Temporary Tables

Using the TEMPORARY argument creates a table that persists only while the server is live. They are useful for storing intermediate result sets that you access more than once.

Adding or dropping a column from a temporary table is not supported.

Example

CREATE TEMPORARY TABLE customers(
   accountId TEXT,
   name TEXT,
   timeCreated TIMESTAMP)

CREATE TABLE AS SELECT

CREATE TABLE [IF NOT EXISTS] <newTableName> AS (<SELECT statement>) [WITH (<property> = value, ...)];

Create a table with the specified columns, copying any data that meet SELECT statement criteria.

WITH Clause Properties

Property

Description

fragment_size

Number of rows per fragment that is a unit of the table for query processing. Default = 32 million rows, which is not expected to be changed.

max_chunk_size

Size of chunk that is a unit of the table for query processing. Default: 1073741824 bytes (1 GB), which is not expected to be changed.

max_rows

Used primarily for streaming datasets to limit the number of rows in a table. When the max_rows limit is reached, the oldest fragment is removed. When populating a table from a file, make sure that your row count is below the max_rows setting. If you attempt load more rows at one time than the max_rows setting defines, the records up to the max_rows limit are removed, leaving only the additional rows. Default = 2^62.

page_size

Number of I/O page bytes. Default = 1MB, which does not need to be changed.

partitions

Partition strategy option:

  • SHARDED: Partition table using sharding.

  • REPLICATED: Partition table using replication.

use_shared_dictionaries

Controls whether the created table creates its own dictionaries for text columns, or instead shares the dictionaries of its source table. Uses shared dictionaries by default (true), which increases the speed of table creation.

Setting to false shrinks the dictionaries if SELECT for the created table has a narrow filter; for example: CREATE TABLE new_table AS SELECT * FROM old_table WITH (USE_SHARED_DICTIONARIES='false');

vacuum

Formats the table to more efficiently handle DELETE requests. The only parameter available is delayed. Rather than immediately remove deleted rows, vacuum marks items to be deleted, and they are removed at an optimal time.

Examples

Create the table newTable. Populate the table with all information from the table oldTable, effectively creating a duplicate of the original table.

CREATE TABLE newTable AS (SELECT * FROM oldTable);

Create a table named trousers. Populate it with data from the columns name, waist, and inseam from the table wardrobe.

CREATE TABLE trousers AS (SELECT name, waist, inseam FROM wardrobe);

Create a table named cosmos. Populate it with data from the columns star and planet from the table universe where planet has the class M.

CREATE TABLE IF NOT EXISTS cosmos AS (SELECT star, planet FROM universe WHERE class='M');

ALTER TABLE

ALTER TABLE <table> RENAME TO <table>;
ALTER TABLE <table> RENAME COLUMN <column> TO <column>;
ALTER TABLE <table> ADD [COLUMN] <column> <type> [NOT NULL] [ENCODING <encodingSpec>];
ALTER TABLE <table> ADD (<column> <type> [NOT NULL] [ENCODING <encodingSpec>], ...);
ALTER TABLE <table> ADD (<column> <type> DEFAULT <value>);
ALTER TABLE <table> DROP COLUMN <column_1>[, <column_2>, ...];
ALTER TABLE <table> SET MAX_ROLLBACK_EPOCHS=<value>;
ALTER TABLE <table> ALTER COLUMN <column> TYPE <type>, ALTER COLUMN <column> TYPE <type>, ...;

Examples

Rename the table tweets to retweets.

ALTER TABLE tweets RENAME TO retweets;

Rename the column source to device in the table retweets.

ALTER TABLE retweets RENAME COLUMN source TO device;

Add the column pt_dropoff to table tweets with a default value point(0,0).

ALTER TABLE tweets ADD COLUMN pt_dropoff POINT DEFAULT 'point(0 0)';

Add multiple columns a, b, and c to table table_one with a default value of 15 for column b.

ALTER TABLE table_one ADD a INTEGER, b INTEGER NOT NULL DEFAULT 15, c TEXT;

Default values currently have the following limitations:

  • Only literals can be used for column DEFAULT values; expressions are not supported.

  • For arrays, use the following syntax: ARRAY[A, B, C, …. N]. The syntax {A, B, C, ... N} is not supported.

  • Some literals, like NUMERIC and GEO types, are not checked at parse time. As a result, you can define and create a table with a malformed literal as a default value, but when you try to insert a row with a default value, it throws an error.

Add the column lang to the table tweets using a TEXT ENCODING DICTIONARY.

ALTER TABLE tweets ADD COLUMN lang TEXT ENCODING DICT;

Add the columns lang and encode to the table tweets using a TEXT ENCODING DICTIONARY for each.

ALTER TABLE tweets ADD (lang TEXT ENCODING DICT, encode TEXT ENCODING DICT);

Drop the column pt_dropoff from table tweets.

ALTER TABLE tweets DROP COLUMN pt_dropoff;

Limit on-disk data growth by setting the number of allowed epoch rollbacks to 50:

ALTER TABLE test_table SET MAX_ROLLBACK_EPOCHS=50;
  • You cannot add a dictionary-encoded string column with a shared dictionary when using ALTER TABLE ADD COLUMN.

  • Currently, HEAVY.AI does not support adding a geo column type (POINT, LINESTRING, POLYGON, or MULTIPOLYGON) to a table.

  • HEAVY.AI supports ALTER TABLE RENAME TABLE and ALTER TABLE RENAME COLUMN for temporary tables. HEAVY.AI does not support ALTER TABLE ADD COLUMN to modify a temporary table.

Change a text column “id” to an integer column:

ALTER TABLE my_table ALTER COLUMN id TYPE INTEGER;

Change text columns “id” and “location” to big integer and point columns respectively:

ALTER TABLE my_table ALTER COLUMN id TYPE BIGINT, ALTER COLUMN location TYPE GEOMETRY(POINT, 4326);

Currently, only text column types (dictionary encoded and none encoded text columns) can be altered.

DROP TABLE

DROP TABLE [IF EXISTS] <table>;

Example

DROP TABLE IF EXISTS tweets;

DUMP TABLE

DUMP TABLE <table> TO '<filepath>' [WITH (COMPRESSION='<compression_program>')];

Archives data and dictionary files of the table <table> to file <filepath>.

Valid values for <compression_program> include:

  • gzip (default)

  • pigz

  • lz4

  • none

If you do not choose a compression option, the system uses gzip if it is available. If gzip is not installed, the file is not compressed.

The file path must be enclosed in single quotes.

  • Dumping a table locks writes to that table. Concurrent reads are supported, but you cannot import to a table that is being dumped.

  • The DUMP command is not supported on distributed configurations.

  • You must have a least GRANT CREATE ON DATABASE privilege level to use the DUMP command.

Example

DUMP TABLE tweets TO '/opt/archive/tweetsBackup.gz' WITH (COMPRESSION='gzip');

RENAME TABLE

RENAME TABLE <table> TO <table>[, <table> TO <table>, <table> TO <table>...];

Rename a table or multiple tables at once.

Examples

Rename a single table:

RENAME TABLE table_A TO table_B;

Swap table names:

RENAME TABLE table_A TO table_B, table_B TO table_A;

RENAME TABLE table_A TO table_B, table_B TO table_C, table_C TO table_A;

Swap table names multiple times:

RENAME TABLE table_A TO table_A_stale, table_B TO table_B_stale, table_A_new TO table_A, table_B_new TO table_B;

RESTORE TABLE

RESTORE TABLE <table> FROM '<filepath>' [WITH (COMPRESSION='<compression_program>')];

Restores data and dictionary files of table <table> from the file at <filepath>. If you specified a compression program when you used the DUMP TABLE command, you must specify the same compression method during RESTORE.

Restoring a table decompresses and then reimports the table. You must have enough disk space for both the new table and the archived table, as well as enough scratch space to decompress the archive and reimport it.

The file path must be enclosed in single quotes.

You can also restore a table from archives stored in S3-compatible endpoints:

RESTORE TABLE <table> FROM '<S3_file_URL>' 
  WITH (compression = '<compression_program>', 
        s3_region = '<region>', 
        s3_access_key = '<access_key>', 
        s3_secret_key = '<secret_key>', 
        s3_session_token = '<session_token>');
  • Restoring a table locks writes to that table. Concurrent reads are supported, but you cannot import to a table that is being restored.

  • The RESTORE command is not supported on distributed configurations.

  • You must have a least GRANT CREATE ON DATABASE privilege level to use the RESTORE command.

Do not attempt to use RESTORE TABLE with a table dump created using a release of HEAVY.AI that is higher than the release running on the server where you will restore the table.

Examples

Restore table tweets from /opt/archive/tweetsBackup.gz:

RESTORE TABLE tweets FROM '/opt/archive/tweetsBackup.gz' 
   WITH (COMPRESSION='gzip');

Restore table tweets from a public S3 file or using server privileges (with the allow-s3-server-privileges server flag enabled):

RESTORE TABLE tweets FROM 's3://my-s3-bucket/archive/tweetsBackup.gz'
   WITH (compression = 'gzip', 
      s3_region = 'us-east-1');

Restore table tweets from a private S3 file using AWS access keys:

RESTORE TABLE tweets FROM 's3://my-s3-bucket/archive/tweetsBackup.gz' 2 
   WITH (compression = 'gzip', 
      s3_region = 'us-east-1', 
      s3_access_key = 'xxxxxxxxxx', s3_secret_key = 'yyyyyyyyy');

Restore table tweets from a private S3 file using temporary AWS access keys/session token:

RESTORE TABLE tweets FROM 's3://my-s3-bucket/archive/tweetsBackup.gz' 
   WITH (compression = 'gzip', 
      s3_region = 'us-east-1', 
      s3_access_key = 'xxxxxxxxxx', s3_secret_key = 'yyyyyyyyy',
      s3_session_token = 'zzzzzzzz');

Restore table tweets from an S3-compatible endpoint:

RESTORE TABLE tweets FROM 's3://my-gcp-bucket/archive/tweetsBackup.gz' 2 
   WITH (compression = 'gzip', 
      s3_region = 'us-east-1', 
      s3_endpoint = 'storage.googleapis.com');

TRUNCATE TABLE

TRUNCATE TABLE <table>;

Use the TRUNCATE TABLE statement to remove all rows from a table without deleting the table structure.

Example

TRUNCATE TABLE tweets;

When you DROP or TRUNCATE, the command returns almost immediately. The directories to be purged are marked with the suffix \_DELETE_ME_. The files are automatically removed asynchronously.

In practical terms, this means that you will not see a reduction in disk usage until the automatic task runs, which might not start for up to five minutes.

You might also see directory names appended with \_DELETE_ME_. You can ignore these, with the expectation that they will be deleted automatically over time.

OPTIMIZE TABLE

OPTIMIZE TABLE [<table>] [WITH (VACUUM='true')]

Use this statement to remove rows from storage that have been marked as deleted via DELETE statements.

When run without the vacuum option, the column-level metadata is recomputed for each column in the specified table. HeavyDB makes heavy use of metadata to optimize query plans, so optimizing table metadata can increase query performance after metadata widening operations such as updates or deletes. If the configuration parameter enable-auto-metadata-update is not set, HeavyDB does not narrow metadata during an update or delete — metadata is only widened to cover a new range.

When run with the vacuum option, it removes any rows marked "deleted" from the data stored on disk. Vacuum is a checkpointing operation, so new copies of any vacuum records are deleted. Using OPTIMIZE with the VACUUM option compacts pages and deletes unused data files that have not been repopulated.

Beginning with Release 5.6.0, OPTIMIZE should be used infrequently, because UPDATE, DELETE, and IMPORT queries manage space more effectively.

VALIDATE

VALIDATE

Performs checks for negative and inconsistent epochs across table shards for single-node configurations.

If VALIDATE detects epoch-related issues, it returns a report similar to the following:

heavysql> validate;
Result

Negative epoch value found for table "my_table". Epoch: -1.
Epoch values for table "my_table_2" are inconsistent:
Table Id  Epoch     
========= ========= 
4         1         
5         2

If no issues are detected, it reports as follows:

Instance OK

VALIDATE CLUSTER

VALIDATE CLUSTER [WITH (REPAIR_TYPE = ['NONE' | 'REMOVE'])];

Perform checks and report discovered issues on a running HEAVY.AI cluster. Compare metadata between the aggregator and leaves to verify that the logical components between the processes are identical.

VALIDATE CLUSTER also detects and reports issues related to table epochs. It reports when epochs are negative or when table epochs across leaf nodes or shards are inconsistent.

Examples

If VALIDATE CLUSTER detects issues, it returns a report similar to the following:

mapd@thing3 ~]$ /mnt/gluster/dist_mapd/mapd-sw2/bin/mapdql -p HyperInteractive
User admin connected to database heavyai
heavysql> validate cluster;
Result
 Node          Table Count 
 ===========   =========== 
 Aggregator     1116
 Leaf 0         1114
 Leaf 1         1114
No matching table on Leaf 0 for Table cities_dtl_POINTS table id 56
No matching table on Leaf 1 for Table cities_dtl_POINTS table id 56
No matching table on Leaf 0 for Table cities_dtl table id 80
No matching table on Leaf 1 for Table cities_dtl table id 80
Table details don't match on Leaf 0 for Table view_geo table id 95
Table details don't match on Leaf 1 for Table view_geo table id 95

If no issues are detected, it will report as follows:

Cluster OK

You can include the WITH(REPAIR_TYPE) argument. (REPAIR_TYPE='NONE') is the same as running the command with no argument. (REPAIR_TYPE='REMOVE') removes any leaf objects that have issues. For example:

VALIDATE CLUSTER WITH (REPAIR_TYPE = 'REMOVE');

Epoch Issue Example

This example output from the VALIDATE CLUSTER command on a distributed setup shows epoch-related issues:

heavysql> validate cluster;
Result

Negative epoch value found for table "my_table". Epoch: -16777216.
Epoch values for table "my_table_2" are inconsistent:
Node      Table Id  Epoch     
========= ========= ========= 
Leaf 0    4         1         
Leaf 1    4         2

SHOW

Use SHOW commands to get information about databases, tables, and user sessions.

SHOW CREATE SERVER

Shows the CREATE SERVER statement that could have been used to create the server.

Syntax

Example

SHOW CREATE TABLE

Shows the CREATE TABLE statement that could have been used to create the table.

Syntax

Example

SHOW DATABASES

Retrieve the databases accessible for the current user, showing the database name and owner.

Example

SHOW FUNCTIONS

Show registered compile-time UDFs and extension functions in the system and their arguments.

Syntax

Example

SHOW POLICIES

Displays a list of all row-level security (RLS) policies that exist for a user or role; admin rights are required. If EFFECTIVE is used, the list also includes any policies that exist for all roles that apply to the requested user or role.

Syntax

SHOW QUERIES

Returns a list of queued queries in the system; information includes session ID, status, query string, account login name, client address, database name, and device type (CPU or GPU).

Example

Admin users can see and interrupt all queries, and non-admin users can see and interrupt only their own queries

NOTE: SHOW QUERIES is only available if the runtime query interrupt parameter (enable-runtime-query-interrupt) is set.

SHOW ROLES

If included with a name, lists the role granted directly to a user or role. SHOW EFFECTIVE ROLES with a name lists the roles directly granted to a user or role, and also lists the roles indirectly inherited through the directly granted roles.

Syntax

If the user name or role name is omitted, then a regular user sees their own roles, and a superuser sees a list of all roles existing in the system.

SHOW RUNTIME FUNCTIONS

Show user-defined runtime functions and table functions.

Syntax

SHOW SUPPORTED DATA SOURCES

Show data connectors.

Syntax

SHOW TABLE DETAILS

Displays storage-related information for a table, such as the table ID/name, number of data/metadata files used by the table, total size of data/metadata files, and table epoch values.

You can see table details for all tables that you have access to in the current database, or for only those tables you specify.

Syntax

Examples

Show details for all tables you have access to:

Show details for table omnisci_states:

The number of columns returned includes system columns. As a result, the number of columns in column_count can be up to two greater than the number of columns created by the user.

SHOW TABLE FUNCTIONS

Displays the list of available system (built-in) table functions.

SHOW TABLE FUNCTIONS DETAILS

Show detailed output information for the specified table function. Output details vary depending on the table function specified.

Syntax

Example - generate_series

View SHOW output for the generate_series table function:

SHOW SERVERS

Retrieve the servers accessible for the current user.

Example

SHOW TABLES

Retrieve the tables accessible for the current user.

Example

SHOW USER DETAILS

Lists name, ID, and default database for all or specified users for the current database. If the command is issued by a superuser, login permission status is also shown. Only superusers see users who do not have permission to log in.

Example

SHOW [ALL] USER DETAILS lists name, ID, superuser status, default database, and login permission status for all users across the HeavyDB instance. This variant of the command is available only to superusers. Regular users who run the SHOW ALL USER DETAILS command receive an error message.

Superuser Output

Show all user details for all users:

Show all user details for specified users ue, ud, ua, and uf:

If a specified user is not found, the superuser sees an error message:

Show user details for specified users ue, ud, and uf:

Show user details for all users:

Non-Superuser Output

Running SHOW ALL USER DETAILS results in an error message:

Show user details for all users:

If a specified user is not found, the user sees an error message:

Show user details for user ua:

SHOW USER SESSIONS

Retrieve all persisted user sessions, showing the session ID, user login name, client address, and database name. Admin or superuser privileges required.

KILL QUERY

Interrupt a queued query. Specify the query by using its session ID.

To interrupt the last query in the list (ID 946-ooNP):

Showing the queries again indicates that 946-ooNP has been deleted:

  • KILL QUERY is only available if the runtime query interrupt parameter (enable-runtime-query-interrupt) is set.

  • Interrupting a query in ‘PENDING_QUEUE’ status is supported in both distributed and single-server mode.

  • To enable query interrupt for tables imported from data files in local storage, set enable_non_kernel_time_query_interrupt to TRUE. (It is enabled by default.)

KILL QUERY

Interrupt a queued query. Specify the query by using its session ID.

To interrupt the last query in the list (ID 946-ooNP):

Showing the queries again indicates that 946-ooNP has been deleted:

  • KILL QUERY is only available if the runtime query interrupt parameter (enable-runtime-query-interrupt) is set.

  • Interrupting a query in ‘PENDING_QUEUE’ status is supported in both distributed and single-server mode.

  • To enable query interrupt for tables imported from data files in local storage, set enable_non_kernel_time_query_interrupt to TRUE. (It is enabled by default.)

INSERT

Examples

You can also insert into a table as SELECT, as shown in the following examples:

You can insert array literals into array columns. The inserts in the following example each have three array values, and demonstrate how you can:

  • Create a table with variable-length and fixed-length array columns.

  • Insert NULL arrays into these colums.

  • Specify and insert array literals using {...} or ARRAY[...] syntax.

  • Insert empty variable-length arrays using{} and ARRAY[] syntax.

  • Insert array values that contain NULL elements.

Default Values

If you omit the name column from an INSERT or INSERT FROM SELECT statement, the missing value for column name is set to 'John Doe'.

INSERT INTO tbl (id, age) VALUES (1, 36); creates the record 1|'John Doe'|36 .

INSERT INTO tbl (id, age) SELECT id, age FROM old_tbl; also sets all the name values to John Doe .

LIKELY/UNLIKELY

Usage Notes

SQL normally assumes that terms in the WHERE clause that cannot be used by indices are usually true. If this assumption is incorrect, it could lead to a suboptimal query plan. Use the LIKELY(X) and UNLIKELY(X) SQL functions to provide hints to the query planner about clause terms that are probably not true, which helps the query planner to select the best possible plan.

Use LIKELY/UNLIKELY to optimize evaluation of OR/AND logical expressions. LIKELY/UNLIKELY causes the left side of an expression to be evaluated first. This allows the right side of the query to be skipped when possible. For example, in the clause UNLIKELY(A) AND B, if A evaluates to FALSE, B does not need to be evaluated.

Consider the following:

If x is one of the values 7, 8, 9, or 10, the filter y > 42 is applied. If x is not one of those values, the filter y > 42 is not applied.

EXPLAIN

Shows generated Intermediate Representation (IR) code, identifying whether it is executed on GPU or CPU. This is primarily used internally by HEAVY.AI to monitor behavior.

For example, when you use the EXPLAIN command on a basic statement, the utility returns 90 lines of IR code that is not meant to be human readable. However, at the top of the listing, a heading indicates whether it is IR for the CPU or IR for the GPU, which can be useful to know in some situations.

EXPLAIN CALCITE

Returns a relational algebra tree describing the high-level plan to execute the statement.

The table below lists the relational algebra classes used to describe the execution plan for a SQL statement.

For example, a SELECT statement is described as a table scan and projection.

If you add a sort order, the table projection is folded under a LogicalSort procedure.

When the SQL statement is simple, the EXPLAIN CALCITE version is actually less “human readable.” EXPLAIN CALCITE is more useful when you work with more complex SQL statements, like the one that follows. This query performs a scan on the BOOK table before scanning the BOOK_ORDER table.

Revising the original SQL command results in a more natural selection order and a more performant query.

EXPLAIN CALCITE DETAILED

Augments the EXPLAIN CALCITE command by adding details about referenced columns in the query plan.

For example, for the following EXPLAIN CALCITE command execution:

EXPLAIN CALCITE DETAILED adds more column details as seen below:

CentOS/RHEL

See for a more complete example.

Name of user, which must exist. See .

E

Name of user, which must exist. See .

Name of user, which must exist. See .

Name of user, which must exist. See .

Name of user, which must exist. See .

Each HEAVY.AI datatype uses space in memory and on disk. For certain datatypes, you can use fixed encoding for a more compact representation of these values. You can set a default value for a column by using the DEFAULT constraint; for more information, see .

DATE]

Note: Importing TEXT ENCODING NONE fields using the has limitations for Immerse. When you use string instead of string [dict. encode] for a column when importing, you cannot use that column in Immerse dashboards.

[2] - See and below for information about geospatial datatype sizes.

For more information about geospatial datatypes and functions, see .

View object names must use the NAME format, described in notation as:

To avoid this error, use the heavysql command \cpu to put your HEAVY.AI server in CPU mode before using the COPY TO command. See .

If there is a potential for duplicate entries, and you want to avoid loading duplicate rows, see on the Troubleshooting page.

This option is available only if the optional is installed; otherwise invoking the option throws an error.

An ESRI file geodatabase can have multiple layers, and importing it results in the creation of one table for each layer in the file. This behavior differs from that of importing shapefiles, GeoJSON, or KML files, which results in a single table. For more information, see .

For more information about importing specific geo file formats, see .

Use the same syntax that you would for , depending on the file source.

Allows specification of one or more band names to selectively import; useful in the context of large raster files where not all the bands are relevant. Bands are imported in the order provided, regardless of order in the file. You can rename bands using <bandname>=<newname>[,<bandname>=<newname,...>] Names must be those discovered by the , including any suffixes for de-duplication.

For information about using ODBC HeavyConnect, see .

For more information on creating regex transformation statements, see .

Access key and secret key, or session token if using temporary credentials, and region are required. For information about AWS S3 credentials, see .

For information about interoperability and setup for Google Cloud Services, see .

The following examples show failed and successful attempts to copy the table from AWS S3.

KafkaImporter requires a functioning Kafka cluster. See the and the .

The following is a straightforward import command. For more information on options and parameters for using Apache Sqoop, see the user guide at .

For more information about users, roles, and privileges, see .

The following are naming convention requirements for HEAVY.AI objects, described in notation:

See in for a database security example.

Choose whether to import from a local file or an Amazon S3 instance. For details on importing from Amazon S3, see .

You can import spatial representations in format. WKT is a text markup language for representing vector geometry objects on a map, spatial reference systems of spatial objects, and transformations between spatial reference systems.

HEAVY.AI accepts data with any SRID, or with no SRID. HEAVY.AI supports SRID 4326 (), and allows projections from SRID 4326 to SRID 900913 (Google Web Mercator). Geometries declared with SRID 4326 are compressed by default, and can be rendered and used to calculate geodesic distance. Geometries declared with any other SRID, or no SRID, are treated as planar geometries; the SRIDs are ignored.

An ESRI file geodatabase can have multiple layers, and importing it results in the creation of one table for each layer in the file. This behavior differs from that of importing shapefiles, GeoJSON, or KML files, which results in a single table. See for more information.

Rendering of geo LINESTRING, MULTILINESTRING, POLYGON and MULTIPOLYGON is possible only with data stored in the default lon/lat WGS84 (SRID 4326) format, although the type and encoding are flexible. Unless compression is explictly disabled (NONE), all SRID 4326 geometries are compressed. For more information, see.

The web_server_access_logs system table provides information about requests made to the Web Server. The table contains the following columns:

Preconfigured are built on various system tables. Specifically, two dashboards named System Resources and User Roles and Permissions are available by default. The Request Logs and Monitoring system dashboard is considered beta functionality and disabled by default. These dashboards can be found in the information_schema database, along with the system tables that they use.

Table names must use the NAME format, described in notation as:

* In HEAVY.AI release 4.4.0 and higher, you can use existing 8-byte DATE columns, but you can create only 4-byte DATE columns (default) and 2-byte DATE columns (see ).

For more information, see .

For geospatial datatypes, see .

Fixed length encoding of integer or timestamp columns. See .

Deletes the table structure, all data from the table, and any dictionary content unless it is a shared dictionary. (See the Note regarding .)

s3_region is required. All features discussed in , such as custom S3 endpoints and server privileges, are supported.

This releases table on-disk and memory storage and removes dictionary content unless it is a shared dictionary. (See the note regarding .)

Removing rows is more efficient than using . Dropping followed by recreating the table invalidates dependent objects of the table requiring you to regrant object privileges. Truncating has none of these effects.

To interrupt a query in the queue, see .

For more information, see .

Output Header
Output Details

To see the queries in the queue, use the command:

To see the queries in the queue, use the command:

Use INSERT for both single- and multi-row ad hoc inserts. (When inserting many rows, use the more efficient command.)

If you with column that has a default value, or to add a column with a default value, using the INSERT command creates a record that includes the default value if it is omitted from the INSERT. For example, assume a table created as follows:

Users and Databases
Users and Databases
Users and Databases
Users and Databases
Tables
DDL-VIEWS
Geospatial Capabilities
regex
Configuration
How can I avoid creating duplicate rows?
Importing Geospatial Files
ODBC Data Wrapper Reference
RegEx Replace
https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys
Cloud Storage Interoperability
trips
Kafka website
Confluent schema registry documentation
sqoop.apache.org
DDL - Roles and Privileges
regex
Well-known Text (WKT)
WGS 84
HEAVY.AI
system dashboards
regex
DATE ENCODING FIXED(16)
Datatypes and Fixed Encoding
disk space reclamation
the S3 import documentation
disk space reclamation
DROP TABLE
Example Roles and Privileges Session
DROP ROLE
CREATE ROLE
CREATE ROLE
GRANT ON TABLE
GRANT ON DATABASE
CREATE ROLE
GRANT
REVOKE ON TABLE
CREATE ROLE
GRANT ON DATABASE
GRANT ON TABLE
GRANT ON DATABASE
REVOKE ON VIEW
CREATE ROLE
GRANT ON DATABASE
GRANT ON VIEW
GRANT ON DATABASE
REVOKE ON DATABAS
GRANT ON TABLE
CREATE ROLE
GRANT ON DATABASE
GRANT ON TABLE
REVOKE ON SERVER
GRANT ON SERVER
REVOKE ON DASHBOARD
GRANT ON DASHBOARD
DDL - Roles and Privileges
Example: Data Security
CREATE TABLE
Storage
Compression
Geospatial Primitives
Importing an ESRI File Geodatabase
Importing an ESRI File Geodatabase
WSG84 Coordinate Compression
geo files
Importing Data from Amazon S3
SHOW CREATE SERVER <servername>
SHOW CREATE SERVER default_local_delimited;
create_server_sql
CREATE SERVER default_local_delimited FOREIGN DATA WRAPPER DELIMITED_FILE
WITH (STORAGE_TYPE='LOCAL_FILE');
SHOW CREATE TABLE <tablename>
SHOW CREATE TABLE heavyai_states;
CREATE TABLE heavyai_states (
 id TEXT ENCODING DICT(32),
 abbr TEXT ENCODING DICT(32),
 name TEXT ENCODING DICT(32),
 omnisci_geo GEOMETRY(MULTIPOLYGON, 4326
) NOT NULL);
SHOW DATABASES
Database         Owner
omnisci          admin
2004_zipcodes    admin
game_results     jane
signals          jason
...
SHOW FUNCTIONS [DETAILS]
SHOW FUNCTIONS
Scalar UDF
distance_point_line
ST_DWithin_Polygon_Polygon
ST_Distance_Point_ClosedLineString
Truncate
ct_device_selection_udf_any
area_triangle
_h3RotatePent60cw
ST_Intersects_Polygon_Point
ST_DWithin_LineString_Polygon
ST_Intersects_Point_Polygon
box_contains_box
SHOW [EFFECTIVE] POLICIES <name>;
show queries;
query_session_id|current_status|submitted          |query_str                                                   |login_name|client_address     |db_name   |exec_device_type
834-8VAA        |Pending       |2020-05-06 08:21:15|select d_date_sk, count(1) from date_dim group by d_date_sk;|admin     |tcp:localhost:48596|tpcds_sf10|CPU
826-CLKk        |Running       |2020-05-06 08:20:57|select count(1) from store_sales, store_returns;            |admin     |tcp:localhost:48592|tpcds_sf10|CPU
828-V6s7        |Pending       |2020-05-06 08:21:13|select count(1) from store_sales;                           |admin     |tcp:localhost:48594|tpcds_sf10|GPU
946-rtJ7        |Pending       |2020-05-06 08:20:58|select count(1) from item;                                  |admin     |tcp:localhost:48610|tpcds_sf10|GPU
SHOW [EFFECTIVE] ROLES <name>
SHOW RUNTIME [TABLE] FUNCTIONS
SHOW RUNTIME [TABLE] FUNCTION DETAILS
show supported data sources
SHOW TABLE DETAILS [<table-name>, <table-name>, ...]
omnisql> show table details;
table_id|table_name       |column_count|is_sharded_table|shard_count|max_rows           |fragment_size|max_rollback_epochs|min_epoch|max_epoch|min_epoch_floor|max_epoch_floor|metadata_file_count|total_metadata_file_size|total_metadata_page_count|total_free_metadata_page_count|data_file_count|total_data_file_size|total_data_page_count|total_free_data_page_count
1       |heavyai_states   |11          |false           |0          |4611686018427387904|32000000     |-1                 |1        |1        |0              |0              |1                  |16777216                |4096                     |4082                          |1              |536870912           |256                  |242
2       |heavyai_counties |13          |false           |0          |4611686018427387904|32000000     |-1                 |1        |1        |0              |0              |1                  |16777216                |4096                     |NULL                          |1              |536870912           |256                  |NULL
3       |heavyai_countries|71          |false           |0          |4611686018427387904|32000000     |-1                 |1        |1        |0              |0              |1                  |16777216                |4096                     |4022                          |1              |536870912           |256                  |182
omnisql> show table details heavyai_states;
table_id|table_name    |column_count|is_sharded_table|shard_count|max_rows           |fragment_size|max_rollback_epochs|min_epoch|max_epoch|min_epoch_floor|max_epoch_floor|metadata_file_count|total_metadata_file_size|total_metadata_page_count|total_free_metadata_page_count|data_file_count|total_data_file_size|total_data_page_count|total_free_data_page_count
1       |heavyai_states|11          |false           |0          |4611686018427387904|32000000     |-1                 |1        |1        |0              |0              |1                  |16777216                |4096                     |4082                          |1              |536870912           |256                  |242
SHOW TABLE FUNCTIONS;
tf_compute_dwell_times
tf_feature_self_similarity
tf_feature_similarity
tf_rf_prop
tf_rf_prop_max_signal
tf_geo_rasterize_slope
tf_geo_rasterize
generate_random_strings
generate_series
tf_mandelbrot_cuda_float
tf_mandelbrot_cuda
tf_mandelbrot_float
tf_mandelbrot
SHOW TABLE FUNCTIONS DETAILS <function_name>

name

generate_series

signature

(i64 series_start, i64 series_stop, i64 series_step) (i64 series_start, i64 series_stop) -> Column

input_names

series_start, series_stop, series_step series_start, series_stop

input_types

i64

output_names

generate_series

output_types

Column i64

CPU

true

GPU

true

runtime

false

filter_table_transpose

false

SHOW SERVERS;
server_name|data_wrapper|created_at|options
default_local_delimited|DELIMITED_FILE|2022-03-15 10:06:05|{"STORAGE_TYPE":"LOCAL_FILE"}
default_local_parquet|PARQUET_FILE|2022-03-15 10:06:05|{"STORAGE_TYPE":"LOCAL_FILE"}
default_local_regex_parsed|REGEX_PARSED_FILE|2022-03-15 10:06:05|{"STORAGE_TYPE":"LOCAL_FILE"}
...
SHOW TABLES;
table_name
----------
omnisci_states
omnisci_counties
omnisci_countries
streets_nyc
streets_miami
...
SHOW USER DETAILS
NAME            ID         DEFAULT_DB 
mike.nuumann    191        mondale
Dale            184        churchill
Editor_Test     141        mondale
Jerry.wong      181        alluvial
AA_superuser    139        
BB_superuser    2140
PlinyTheElder   183        windsor
aaron.tyre      241        db1
achristie       243        sid
eve.mandela     202        nancy
...
heavysql> show all user details;
NAME|ID|IS_SUPER|DEFAULT_DB|CAN_LOGIN
admin|0|true|(-1)|true
ua|2|false|db1(2)|true
ub|3|false|db1(2)|true
uc|4|false|db1(2)|false
ud|5|false|db2(3)|true
ue|6|false|db2(3)|true
uf|7|false|db2(3)|false
heavysql> \db db2
User admin switched to database db2

heavysql> show all user details ue, ud, uf, ua;
NAME|ID|IS_SUPER|DEFAULT_DB|CAN_LOGIN
ua|2|false|db1(2)|true
ud|5|false|db2(3)|true
ue|6|false|db2(3)|true
uf|7|false|db2(3)|false
heavysql> show user details ue, ud, uf, ua;
User "ua" not found. 
heavysql> show user details ue, ud, uf;
NAME|ID|DEFAULT_DB|CAN_LOGIN
ud|5|db2(3)|true
ue|6|db2(3)|true
uf|7|db2(3)|false
heavysql> show user details;
NAME|ID|DEFAULT_DB|CAN_LOGIN
ud|5|db2(3)|true
ue|6|db2(3)|true
uf|7|db2(3)|false
heavysql> \db
User ua is using database db1
heavysql> show all user details;
SHOW ALL USER DETAILS is only available to superusers. (Try SHOW USER DETAILS instead?)
heavysql> show user details;
NAME|ID|DEFAULT_DB
ua|2|db1
ub|3|db1
heavysql> show user details ua, ub, uc;
User "uc" not found.
heavysql> show user details ua;
NAME|ID|DEFAULT_DB
ua|2|db1
SHOW USER SESSIONS;
session_id   login_name   client_address         db_name
453-X6ds     mike         http:198.51.100.1      game_results
453-0t2r     erin         http:198.51.100.11     game_results
421-B64s     shauna       http:198.51.100.43     game_results
213-06dw     ahmed        http:198.51.100.12     signals
333-R28d     cat          http:198.51.100.233    signals
497-Xyz6     inez         http:198.51.100.5      ships
...
show queries;
query_session_id|current_status      |executor_id|submitted     |query_str       |login_name|client_address            |db_name|exec_device_type
713-t1ax        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
491-xpfb        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |Patrick   |http:::1                  |omnisci|GPU
451-gp2c        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
190-5pax        |PENDING_EXECUTOR    |1          |2021-08-03 ...|SELECT ...      |Cavin     |http:::1                  |omnisci|GPU
720-nQtV        |RUNNING_QUERY_KERNEL|2          |2021-08-03 ...|SELECT ...      |Cavin     |tcp:::ffff:127.0.0.1:50142|omnisci|GPU
947-ooNP        |RUNNING_IMPORTER    |0          |2021-08-03 ...|IMPORT_GEO_TABLE|Rio       |tcp:::ffff:127.0.0.1:47314|omnisci|CPU
kill query '946-ooNP'
show queries;
query_session_id|current_status      |executor_id|submitted     |query_str       |login_name|client_address            |db_name|exec_device_type
713-t1ax        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
491-xpfb        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |Patrick   |http:::1                  |omnisci|GPU
451-gp2c        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
190-5pax        |PENDING_EXECUTOR    |1          |2021-08-03 ...|SELECT ...      |Cavin     |http:::1                  |omnisci|GPU
720-nQtV        |RUNNING_QUERY_KERNEL|2          |2021-08-03 ...|SELECT ...      |Cavin     |tcp:::ffff:127.0.0.1:50142|omnisci|GPU
show queries;
query_session_id|current_status      |executor_id|submitted     |query_str       |login_name|client_address            |db_name|exec_device_type
713-t1ax        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
491-xpfb        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |Patrick   |http:::1                  |omnisci|GPU
451-gp2c        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
190-5pax        |PENDING_EXECUTOR    |1          |2021-08-03 ...|SELECT ...      |Cavin     |http:::1                  |omnisci|GPU
720-nQtV        |RUNNING_QUERY_KERNEL|2          |2021-08-03 ...|SELECT ...      |Cavin     |tcp:::ffff:127.0.0.1:50142|omnisci|GPU
947-ooNP        |RUNNING_IMPORTER    |0          |2021-08-03 ...|IMPORT_GEO_TABLE|Rio       |tcp:::ffff:127.0.0.1:47314|omnisci|CPU
kill query '946-ooNP'
show queries;
query_session_id|current_status      |executor_id|submitted     |query_str       |login_name|client_address            |db_name|exec_device_type
713-t1ax        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
491-xpfb        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |Patrick   |http:::1                  |omnisci|GPU
451-gp2c        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
190-5pax        |PENDING_EXECUTOR    |1          |2021-08-03 ...|SELECT ...      |Cavin     |http:::1                  |omnisci|GPU
720-nQtV        |RUNNING_QUERY_KERNEL|2          |2021-08-03 ...|SELECT ...      |Cavin     |tcp:::ffff:127.0.0.1:50142|omnisci|GPU
INSERT INTO <table> (column1, ...) VALUES (row_1_value_1, ...), ..., (row_n_value_1, ...);
CREATE TABLE ar (ai INT[], af FLOAT[], ad2 DOUBLE[2]); 
INSERT INTO ar VALUES ({1,2,3},{4.0,5.0},{1.2,3.4}); 
INSERT INTO ar VALUES (ARRAY[NULL,2],NULL,NULL); 
INSERT INTO ar VALUES (NULL,{},{2.0,NULL});
-- or a multi-row insert equivalent
INSERT INTO ar VALUES ({1,2,3},{4.0,5.0},{1.2,3.4}), (ARRAY[NULL,2],NULL,NULL), (NULL,{},{2.0,NULL});
INSERT INTO destination_table SELECT * FROM source_table;
INSERT INTO destination_table (id, name, age, gender) SELECT * FROM source_table;
INSERT INTO destination_table (name, gender, age, id) SELECT name, gender, age, id  FROM source_table;
INSERT INTO votes_summary (vote_id, vote_count) SELECT vote_id, sum(*) FROM votes GROUP_BY vote_id;
CREATE TABLE ar (ai INT[], af FLOAT[], ad2 DOUBLE[2]); 
INSERT INTO ar VALUES ({1,2,3},{4.0,5.0},{1.2,3.4}); 
INSERT INTO ar VALUES (ARRAY[NULL,2],NULL,NULL); 
INSERT INTO ar VALUES (NULL,{},{2.0,NULL});
CREATE TABLE tbl (
   id INTEGER NOT NULL, 
   name TEXT NOT NULL DEFAULT 'John Doe', 
   age SMALLINT NOT NULL);

Expression

Description

LIKELY(X)

Provides a hint to the query planner that argument X is a Boolean value that is usually true. The planner can prioritize filters on the value X earlier in the execution cycle and return results more efficiently.

UNLIKELY(X)

Provides a hint to the query planner that argument X is a Boolean value that is usually not true. The planner can prioritize filters on the value X later in the execution cycle and return results more efficiently.

SELECT COUNT(*) FROM test WHERE UNLIKELY(x IN (7, 8, 9, 10)) AND y > 42;
EXPLAIN <STMT>
EXPLAIN CALCITE <STMT>
heavysql> EXPLAIN CALCITE (SELECT * FROM movies);
Explanation
LogicalProject(movieId=[$0], title=[$1], genres=[$2])
   LogicalTableScan(TABLE=[[CATALOG, heavyai, MOVIES]])
heavysql> EXPLAIN calcite (SELECT * FROM movies ORDER BY title);
Explanation
LogicalSort(sort0=[$1], dir0=[ASC])
   LogicalProject(movieId=[$0], title=[$1], genres=[$2])
      LogicalTableScan(TABLE=[[CATALOG, omnisci, MOVIES]])
heavysql> EXPLAIN calcite SELECT bc.firstname, bc.lastname, b.title, bo.orderdate, s.name
FROM book b, book_customer bc, book_order bo, shipper s
WHERE bo.cust_id = bc.cust_id AND b.book_id = bo.book_id AND bo.shipper_id = s.shipper_id
AND s.name = 'UPS';
Explanation
LogicalProject(firstname=[$5], lastname=[$6], title=[$2], orderdate=[$11], name=[$14])
    LogicalFilter(condition=[AND(=($9, $4), =($0, $8), =($10, $13), =($14, 'UPS'))])
        LogicalJoin(condition=[true], joinType=[INNER])
            LogicalJoin(condition=[true], joinType=[INNER])
                LogicalJoin(condition=[true], joinType=[INNER])
                    LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK]])
                    LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK_CUSTOMER]])
                LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK_ORDER]])
            LogicalTableScan(TABLE=[[CATALOG, omnisci, SHIPPER]])
heavysql> EXPLAIN calcite SELECT bc.firstname, bc.lastname, b.title, bo.orderdate, s.name
FROM book_order bo, book_customer bc, book b, shipper s
WHERE bo.cust_id = bc.cust_id AND bo.book_id = b.book_id AND bo.shipper_id = s.shipper_id
AND s.name = 'UPS';
Explanation
LogicalProject(firstname=[$10], lastname=[$11], title=[$7], orderdate=[$3], name=[$14])
    LogicalFilter(condition=[AND(=($1, $9), =($5, $0), =($2, $13), =($14, 'UPS'))])
        LogicalJoin(condition=[true], joinType=[INNER])
            LogicalJoin(condition=[true], joinType=[INNER])
                LogicalJoin(condition=[true], joinType=[INNER])
                  LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK_ORDER]])
                  LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK_CUSTOMER]])
                LogicalTableScan(TABLE=[[CATALOG, omnisci, BOOK]])
            LogicalTableScan(TABLE=[[CATALOG, omnisci, SHIPPER]])
heavysql> EXPLAIN CALCITE SELECT x, SUM(y) FROM test GROUP BY x;
Explanation
LogicalAggregate(group=[{0}], EXPR$1=[SUM($1)])
  LogicalProject(x=[$0], y=[$2])
    LogicalTableScan(table=[[testDB, test]])
heavysql> EXPLAIN CALCITE DETAILED SELECT x, SUM(y) FROM test GROUP BY x;
Explanation
LogicalAggregate(group=[{0}], EXPR$1=[SUM($1)])	{[$1->db:testDB,tableName:test,colName:y]}
  LogicalProject(x=[$0], y=[$2])	{[$2->db:testDB,tableName:test,colName:y], [$0->db:testDB,tableName:test,colName:x]}
    LogicalTableScan(table=[[testDB, test]])

Logical Operators and Conditional and Subquery Expressions

Logical Operator Support

Operator

Description

AND

Logical AND

NOT

Negates value

OR

Logical OR

Conditional Expression Support

Expression

Description

CASE WHEN condition THEN result ELSE default END

Case operator

COALESCE(val1, val2, ..)

Returns the first non-null value in the list

Geospatial and array column projections are not supported in the COALESCE function and CASE expressions.

Subquery Expression Support

Expression

Description

expr IN (subquery or list of values)

Evaluates whether expr equals any value of the IN list.

expr NOT IN (subquery or list of values)

Evaluates whether expr does not equal any value of the IN list.

Usage Notes

  • You can use a subquery anywhere an expression can be used, subject to any runtime constraints of that expression. For example, a subquery in a CASE statement must return exactly one row, but a subquery can return multiple values to an IN expression.

  • You can use a subquery anywhere a table is allowed (for example, FROM subquery), using aliases to name any reference to the table and columns returned by the subquery.

UPDATE

Changes the values of the specified columns based on the assign argument (identifier=expression) in all rows that satisfy the condition in the WHERE clause.

UPDATE table_name SET assign [, assign ]* [ WHERE booleanExpression ]

Example

UPDATE UFOs SET shape='ovate' where shape='eggish';

Currently, HEAVY.AI does not support updating a geo column type (POINT, MULTIPOINT, LINESTRING, MULTILINESTRING, POLYGON, or MULTIPOLYGON) in a table.

Update Via Subquery

You can update a table via subquery, which allows you to update based on calculations performed on another table.

Examples

UPDATE test_facts SET lookup_id = (SELECT SAMPLE(test_lookup.id) 
FROM test_lookup WHERE test_lookup.val = test_facts.val);
UPDATE test_facts SET val = val+1, lookup_id = (SELECT SAMPLE(test_lookup.id)
FROM test_lookup WHERE test_lookup.val = test_facts.val);
UPDATE test_facts SET lookup_id = (SELECT SAMPLE(test_lookup.id) 
FROM test_lookup WHERE test_lookup.val = test_facts.val) WHERE id < 10;

Cross-Database Queries

In Release 6.4 and higher, you can run UPDATE queries across tables in different databases on the same HEAVY.AI cluster without having to first connect to those databases.

To execute queries against another database, you must have ACCESS privilege on that database, as well as UPDATE privilege.

Example

Update a row in a table in the my_other_db database:

UPDATE my_other_db.customers SET name = 'Joe' WHERE id = 10;

Type Casts

Expression

Example

Description

CAST(expr AS type)

CAST(1.25 AS FLOAT)

Converts an expression to another data type. For conversions to a TEXT type, use TRY_CAST.

TRY_CAST(text_expr AS type)

CAST('1.25' AS FLOAT)

Converts a text to a non-text type, returning null if the conversion could not be successfully performed.

ENCODE_TEXT(none_encoded_str)

ENCODE_TEXT(long_str)

Converts a none-encoded text type to a dictionary-encoded text type.

The following table shows cast type conversion support.

FROM/TO:

TINYINT

SMALLINT

INTEGER

BIGINT

FLOAT

DOUBLE

DECIMAL

TEXT

BOOLEAN

DATE

TIME

TIMESTAMP

TINYINT

-

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

n/a

SMALLINT

Yes

-

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

n/a

INTEGER

Yes

Yes

-

Yes

Yes

Yes

Yes

Yes

Yes

No

No

No

BIGINT

Yes

Yes

Yes

-

Yes

Yes

Yes

Yes

No

No

No

No

FLOAT

Yes

Yes

Yes

Yes

-

Yes

No

Yes

No

No

No

No

DOUBLE

Yes

Yes

Yes

Yes

Yes

-

No

Yes

No

No

No

n/a

DECIMAL

Yes

Yes

Yes

Yes

Yes

Yes

-

Yes

No

No

No

n/a

TEXT

Yes (Use TRY_CAST)

Yes (Use TRY_CAST)

Yes (Use TRY_CAST)

Yes (Use TRY_CAST)

Yes (Use TRY_CAST)

Yes (Use TRY_CAST)

Yes (Use TRY_CAST)

-

Yes (Use TRY_CAST)

Yes (Use TRY_CAST)

Yes (Use TRY_CAST)

Yes (Use TRY_CAST)

BOOLEAN

No

No

Yes

No

No

No

No

Yes

-

n/a

n/a

n/a

DATE

No

No

No

No

No

No

No

Yes

n/a

-

No

Yes

TIME

No

No

No

No

No

No

No

Yes

n/a

No

-

n/a

TIMESTAMP

No

No

No

No

No

No

No

Yes

n/a

Yes

No

-

generate_series

generate_series (Integers)

Generate a series of integer values.

SELECT * FROM TABLE(
    generate_series(
        <series_start>,
        <series_end>
        [, <increment>]
    )

Input Arguments

Parameter
Description
Data Types

<series_start>

Starting integer value, inclusive.

BIGINT

<series_end>

Ending integer value, inclusive.

BIGINT

<series_step> (optional, defaults to 1)

Increment to increase or decrease and values that follow. Integer.

BIGINT

Output Columns

Name
Description
Data Types

generate_series

The integer series specified by the input arguments.

Column<BIGINT>

Example

heavysql> select * from table(generate_series(2, 10, 2)); 
series 
2 
4 
6 
8 
10 
5 rows returned.

heavysql> select * from table(generate_series(8, -4, -3)); 
series 
8 
5 
2 
-1 
-4
5 rows returned.

generate_series (Timestamps)

Generate a series of timestamp values from start_timestamp to end_timestamp .

SELECT * FROM TABLE(
    generate_series(
        <series_start>,
        <series_end>,
        <series_step>
    )
)

Input Arguments

Parameter
Description
Data Types

series_start

Starting timestamp value, inclusive.

TIMESTAMP(9) (Timestamp literals with other precisions will be auto-casted to TIMESTAMP(9) )

series_end

Ending timestamp value, inclusive.

TIMESTAMP(9) (Timestamp literals with other precisions will be auto-casted to TIMESTAMP(9) )

series_step

Time/Date interval signifying step between each element in the returned series.

INTERVAL

Output Columns

Name
Description
Output Types

generate_series

The timestamp series specified by the input arguments.

COLUMN<TIMESTAMP(9)>

Example

SELECT
  generate_series AS ts
FROM
  TABLE(
    generate_series(
      TIMESTAMP(0) '2021-01-01 00:00:00',
      TIMESTAMP(0) '2021-09-04 00:00:00',
      INTERVAL '1' MONTH
    )
  )
  ORDER BY ts;
  
ts
2021-01-01 00:00:00.000000000
2021-02-01 00:00:00.000000000
2021-03-01 00:00:00.000000000
2021-04-01 00:00:00.000000000
2021-05-01 00:00:00.000000000
2021-06-01 00:00:00.000000000
2021-07-01 00:00:00.000000000
2021-08-01 00:00:00.000000000
2021-09-01 00:00:00.000000000

tf_compute_dwell_times

Given a query input with entity keys (for example, user IP addresses) and timestamps (for example, page visit timestamps), and parameters specifying the minimum session time, the minimum number of session records, and the max inactive seconds, outputs all unique sessions found in the data with the duration of the session (dwell time).

Syntax

select * from table( 
  tf_compute_dwell_times( 
    data => CURSOR( 
      select 
        entity_id, 
        site_id, 
        ts, 
      from 
        <table> 
      where 
        ... 
        ), 
        min_dwell_seconds => <seconds>, 
        min_dwell_points => <points>, 
        max_inactive_seconds => <seconds> 
        ) 
      );

Input Arguments

Parameter
Description
Data Type

entity_id

Column containing keys/IDs used to identify the entities for which dwell/session times are to be computed. Examples include IP addresses of clients visiting a website, login IDs of database users, MMSIs of ships, and call signs of airplanes.

Column<TEXT ENCODING DICT | BIGINT>

site_id

Column containing keys/IDs of dwell “sites” or locations that entities visit. Examples include website pages, database session IDs, ports, airport names, or binned h3 hex IDs for geographic location.

Column<TEXT ENCODING DICT | BIGINT>

ts

Column denoting the time at which an event occurred.

Column<TIMESTAMP(0|3|6|0)>

min_dwell_seconds

Constant integer value specifying the minimum number of seconds required between the first and last timestamp-ordered record for an entity_id at a site_id to constitute a valid session and compute and return an entity’s dwell time at a site. For example, if this variable is set to 3600 (one hour), but only 1800 seconds elapses between an entity’s first and last ordered timestamp records at a site, these records are not considered a valid session and a dwell time for that session is not calculated.

BIGINT (other integer types are automatically casted to BIGINT)

min_dwell_points

A constant integer value specifying the minimum number of successive observations (in ts timestamp order) required to constitute a valid session and compute and return an entity’s dwell time at a site. For example, if this variable is set to 3, but only two consecutive records exist for a user at a site before they move to a new site, no dwell time is calculated for the user.

BIGINT (other integer types are automatically casted to BIGINT)

max_inactive_seconds

A constant integer value specifying the maximum time in seconds between two successive observations for an entity at a given site before the current session/dwell time is considered finished and a new session/dwell time is started. For example, if this variable is set to 86400 seconds (one day), and the time gap between two successive records for an entity id at a given site id is 86500 seconds, the session is considered ended at the first timestamp-ordered record, and a new session is started at the timestamp of the second record.

BIGINT (other integer types are automatically casted to BIGINT)

Output Columns

Name
Description
Data Type

entity_id

The ID of the entity for the output dwell time, identical to the corresponding entity_id column in the input.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the entity_id input column type)

site_id

The site ID for the output dwell time, identical to the corresponding site_id column in the input.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the site_id input column type)

prev_site_id

The site ID for the session preceding the current session, which might be a different site_id, the same site_id (if successive records for an entity at the same site were split into multiple sessions because the max_inactive_seconds threshold was exceeded), or null if the last site_id visited was null.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the site_id input column type)

next_site_id

The site id for the session after the current session, which might be a different site_id, the same site_id (if successive records for an entity at the same site were split into multiple sessions due to exceeding the max_inactive_seconds threshold, or null if the next site_id visited was null.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type will be the same as the site_id input column type)

session_id

An auto-incrementing session ID specific/relative to the current entity_id, starting from 1 (first session) up to the total number of valid sessions for an entity_id, such that each valid session dwell time increments the session_id for an entity by 1.

Column<INT>

start_seq_id

The index of the nth timestamp (ts-ordered) record for a given entity denoting the start of the current output row's session.

Column<INT>

dwell_time_sec

The duration in seconds for the session.

Column<INT>

num_dwell_points

The number of records/observations constituting the current output row's session.

Column<INT>

Example

/* Data from https://www.kaggle.com/datasets/vodclickstream/netflix-audience-behaviour-uk-movies */

select
  *
from
  table(
    tf_compute_dwell_times(
      data => cursor(
        select
          user_id,
          movie_id,
          ts
        from
          netflix_audience_behavior
      ),
      min_dwell_points => 3,
      min_dwell_seconds => 600,
      max_inactive_seconds => 10800
    )
  )
order by
  num_dwell_points desc
limit
  10;

entity_id|site_id|prev_site_id|next_site_id|session_id|start_seq_id|ts|dwell_time_sec|num_dwell_points
59416738c3|cbdf9820bc|d058594d1c|863b39bbe8|2|19|2017-02-21 15:12:11.000000000|4391|54
16d994f6dd|1bae944666|4f1cf3c2dc|NULL|5|61|2017-11-11 20:27:02.000000000|9570|36
3675d9ba4a|948f2b5bf6|948f2b5bf6|69cb38018a|2|11|2018-11-26 18:42:52.000000000|3600|34
da01959c0b|fd711679f9|1f579d43c3|NULL|5|90|2019-03-21 05:37:22.000000000|7189|31
23c52f9b50|df00041e47|df00041e47|NULL|2|39|2019-01-21 15:53:33.000000000|1227|29
da01959c0b|8ab46a0cb1|f1fffa6ff4|1f579d43c3|3|29|2019-03-12 04:33:01.000000000|6026|29
23c52f9b50|df00041e47|NULL|df00041e47|1|10|2019-01-21 15:33:39.000000000|1194|28
da01959c0b|1f579d43c3|8ab46a0cb1|fd711679f9|4|63|2019-03-17 02:01:49.000000000|7240|27
3261cb81a5|1cb40406ae|NULL|NULL|1|2|2019-04-28 20:48:24.000000000|11240|27
dbed64ce9e|c5830185ca|NULL|NULL|1|3|2019-03-01 06:43:32.000000000|7261|25

generate_random_strings

Generates random string data.

SELECT * FROM TABLE(generate_random_strings(<num_strings>, <string_length>/)

Input Arguments

Parameter
Description
Data Type

<num_strings>

The number of strings to randomly generate.

BIGINT

<string_length>

Length of the generated strings.

BIGINT

Output Columns

Name
Description
Data Type

id

Integer id of output, starting at 0 and increasing monotonically

Column<BIGINT>

rand_str

Random String

Column<TEXT ENCODING DICT>

Example

heavysql> SELECT * FROM TABLE(generate_random_strings(10, 20);
id|rand_str
0 |He9UeknrGYIOxHzh5OZC
1 |Simnx7WQl1xRihLiH56u
2 |m5H1lBTOErpS8is00YJ
3 |eeDiNHfKzVQsSg0qHFS0
4 |JwOhUoQEI6Z0L78mj8jo
5 |kBTbSIMm25dvf64VMi
6 |W3lUUvC5ajm0W24JML
7 |XdtSQfdXQ85nvaIoyYUY
8 |iUTfGN5Jaj25LjGJhiRN
9 |72GUoTK2BzcBJVTgTGW
Users and Databases
Users and Databases
Users and Databases
Users and Databases
Users and Databases
Data Manager
GEOS library
Datatypes and Fixed Encoding
CREATE ROLE
DROP ROLE
GRANT
REVOKE
GRANT ON TABLE
REVOKE ON TABLE
GRANT ON VIEW
REVOKE ON VIEW
GRANT ON DATABASE
REVOKE ON DATABASE
GRANT ON SERVER
REVOKE ON SERVER
GRANT ON DASHBOARD
REVOKE ON DASHBOARD
[1
KILL QUERY
SHOW QUERIES
SHOW QUERIES
System Table Functions
COPY
create a table
alter a table
User-Defined Functions

Method

Description

LogicalAggregate

Operator that eliminates duplicates and computes totals.

LogicalCalc

Expression that computes project expressions and also filters.

LogicalChi

Operator that converts a stream to a relation.

LogicalCorrelate

Operator that performs nested-loop joins.

LogicalDelta

Operator that converts a relation to a stream.

LogicalExchange

Expression that imposes a particular distribution on its input without otherwise changing its content.

LogicalFilter

Expression that iterates over its input and returns elements for which a condition evaluates to true.

LogicalIntersect

Expression that returns the intersection of the rows of its inputs.

LogicalJoin

Expression that combines two relational expressions according to some condition.

LogicalMatch

Expression that represents a MATCH_RECOGNIZE node.

LogicalMinus

Expression that returns the rows of its first input minus any matching rows from its other inputs. Corresponds to the SQL EXCEPT operator.

LogicalProject

Expression that computes a set of ‘select expressions’ from its input relational expression.

LogicalSort

Expression that imposes a particular sort order on its input without otherwise changing its content.

LogicalTableFunctionScan

Expression that calls a table-valued function.

LogicalTableModify

Expression that modifies a table. Similar to TableScan, but represents a request to modify a table instead of read from it.

LogicalTableScan

Reads all the rows from a RelOptTable.

LogicalUnion

Expression that returns the union of the rows of its inputs, optionally eliminating duplicates.

LogicalValues

Expression for which the value is a sequence of zero or more literal row values.

LogicalWindow

Geospatial Capabilities

HEAVY.AI supports a subset of object types and functions for storing and writing queries for geospatial definitions.

Geospatial Datatypes

Type

Size

Example

LINESTRING

Variable

A sequence of 2 or more points and the lines that connect them. For example: LINESTRING(0 0,1 1,1 2)

MULTIPOLYGON

Variable

A set of one or more polygons. For example:MULTIPOLYGON(((0 0,4 0,4 4,0 4,0 0),(1 1,2 1,2 2,1 2,1 1)), ((-1 -1,-1 -2,-2 -2,-2 -1,-1 -1)))

POINT

Variable

A point described by two coordinates. When the coordinates are longitude and latitude, HEAVY.AI stores longitude first, and then latitude. For example: POINT(0 0)

POLYGON

Variable

A set of one or more rings (closed line strings), with the first representing the shape (external ring) and the rest representing holes in that shape (internal rings). For example: POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))

MULTIPOINT

Variable

A set of one or more points. For example: MULTIPOINT((0 0), (1 1), (2 2))

MULTILINESTRING

Variable

A set of one or more associated lines, each of two or more points. For example: MULTILINESTRING((0 0, 1 0, 2 0), (0 1, 1 1, 2 1))

CREATE TABLE simple_geo (
                          name TEXT ENCODING DICT(32), 
                          location GEOMETRY(POINT,4326)
                         );

If you do not set the SRID of the geo field in the table, you can set it in a SQL query using ST_SETSRID(column_name, SRID). For example, ST_SETSRID(a.pt,4326).

When representing longitude and latitude, the first coordinate is assumed to be longitude in HEAVY.AI geospatial primitives.

You create geospatial objects as geometries (planar spatial data types), which are supported by the planar geometry engine at run time. When you call ST_DISTANCE on two geometry objects, the engine returns the shortest straight-line planar distance, in degrees, between those points. For example, the following query returns the shortest distance between the point(s) in p1 and the polygon(s) in poly1:

SELECT ST_DISTANCE(p1, poly1) FROM geo1;

Geospatial Literals

Geospatial functions that expect geospatial object arguments accept geospatial columns, geospatial objects returned by other functions, or string literals containing WKT representations of geospatial objects. Supplying a WKT string is equivalent to calling a geometry constructor. For example, these two queries are identical:

SELECT COUNT(*) FROM geo1 WHERE ST_DISTANCE(p1, `POINT(1 2)`) < 1.0;
SELECT COUNT(*) FROM geo1 WHERE ST_DISTANCE(p1, ST_GeomFromText('POINT(1 2)')) < 1.0;

You can create geospatial literals with a specific SRID. For example:

SELECT ST_CONTAINS(
                     mpoly2, 
                     ST_GeomFromText('POINT(-71.064544 42.28787)', 4326)
                   )
                   FROM geo2;

Support for Geography

HEAVY.AI provides support for geography objects and geodesic distance calculations, with some limitations.

Exporting Coordinates from Immerse

HeavyDB supports import from any coordinate system supported by the Geospatial Data Abstraction Library (GDAL). On import, HeavyDB will convert to and store in WGS84 encoding, and rendering is accurate in Immerse.

However, no built-in way to reference the original coordinates currently exists in Immerse, and coordinates exported from Immerse will be WGS84 coordinates. You can work around this limitation by adding to the dataset a column or columns in non-geo format that could be included for display in Immerse (for example, in a popup) or on export.

Distance Calculation

Currently, HEAVY.AI supports spheroidal distance calculation between:

  • Two points using either SRID 4326 or 900913.

  • A point and a polygon/multipolygon using SRID 900913.

Using SRID 900913 results in variance compared to SRID 4326 as polygons approach the North and South Poles.

The following query returns the points and polygons within 1,000 meters of each other:

SELECT a.poly_name, b.pt_name FROM poly a, pt b 
WHERE ST_Distance(
   ST_Transform(b.heavyai_geo, 900913),
   ST_Transform(b.location, 900913))<1000;

Geospatial Functions

HEAVY.AI supports the functions listed.

Geometry Constructors

Function

Description

ST_Centroid

Computes the geometric center of a geometry as a POINT.

ST_GeomFromText(WKT)

Return a specified geometry value from Well-known Text representation.

ST_GeomFromText(WKT, SRID)

Return a specified geometry value from Well-known Text representation and an SRID.

ST_GeogFromText(WKT)

Return a specified geography value from Well-known Text representation.

ST_GeogFromText(WKT, SRID)

Return a specified geography value from Well-known Text representation and an SRID.

ST_Point(double lon, double lat)

Return a point constructed on the fly from the provided coordinate values. Constant coordinates result in construction of a POINT literal.

Example: ST_Contains(poly4326, ST_SetSRID(ST_Point(lon, lat), 4326))

Geometry to String Conversion

Function
Description

ST_AsText(geom) | ST_AsWKT(geom)

Converts a geometry input to a Well-Known-Text (WKT) string

ST_AsBinary(geom) | ST_AsWKB(geom)

Converts a geometry input to a Well-Known-Binary (WKB) string

Geometry Processing

Function
Description

ST_Buffer

Returns a geometry covering all points within a specified distance from the input geometry. Performed by the GEOS module. The output is currently limited to the MULTIPOLYGON type.

Calculations are in the units of the input geometry’s SRID. Buffer distance is expressed in the same units. Example:

SELECT ST_Buffer('LINESTRING(0 0, 10 0, 10 10)', 1.0);

Special processing is automatically applied to WGS84 input geometries (SRID=4326) to limit buffer distortion:

  • Implementation first determines the best planar SRID to which to project the 4326 input geometry.

  • Preferred SRIDs are UTM and Lambert (LAEA) North/South zones, with Mercator used as a fallback.

  • Buffer distance is interpreted as distance in meters (units of all planar SRIDs being considered).

  • The input geometry is transformed to the best planar SRID and handed to GEOS, along with buffer distance.

  • The buffer geometry built by GEOS is then transformed back to SRID=4326 and returned.

Example: Build 10-meter buffer geometries (SRID=4326) with limited distortion:

SELECT ST_Buffer(poly4326, 10.0) FROM tbl;

ST_Centroid

Computes the geometric center of a geometry as a POINT.

FunctionDescription Special processing is automatically applied to WGS84 input geometries (SRID=4326) to limit buffer distortion:

  • Implementation first determines the best planar SRID to which to project the 4326 input geometry.

  • Preferred SRIDs are UTM and Lambert (LAEA) North/South zones, with Mercator used as a fallback.

  • Buffer distance is interpreted as distance in meters (units of all planar SRIDs being considered).

  • The input geometry is transformed to the best planar SRID and handed to GEOS, along with buffer distance.

  • The buffer geometry built by GEOS is then transformed back to SRID=4326 and returned.

Example: Build 10-meter buffer geometries (SRID=4326) with limited distortion:SELECT ST_Buffer(poly4326, 10.0) FROM tbl; .ST_CentroidComputes the geometric center of a geometry as a POINT.

Geometry Editors

Function

Description

ST_TRANSFORM

Returns a geometry with its coordinates transformed to a different spatial reference. Currently, WGS84 to Web Mercator transform is supported. For example:ST_DISTANCE( ST_TRANSFORM(ST_GeomFromText('POINT(-71.064544 42.28787)', 4326), 900913), ST_GeomFromText('POINT(-13189665.9329505 3960189.38265416)', 900913) )

ST_TRANSFORM is not currently supported in projections. It can be used only to transform geo inputs to other functions, such as ST_DISTANCE.

ST_SETSRID

ST_TRANSFORM(

ST_SETSRID(ST_GeomFromText('POINT(-71.064544 42.28787)'), 4326), 900913 )

Geometry Accessors

Function

Description

ST_X

Returns the X value from a POINT column.

ST_Y

Returns the Y value from a POINT column.

ST_XMIN

Returns X minima of a geometry.

ST_XMAX

Returns X maxima of a geometry.

ST_YMIN

Returns Y minima of a geometry.

ST_YMAX

Returns Y maxima of a geometry.

ST_STARTPOINT

Returns the first point of a LINESTRING as a POINT.

ST_ENDPOINT

Returns the last point of a LINESTRING as a POINT.

ST_POINTN

Return the Nth point of a LINESTRING as a POINT.

ST_NPOINTS

Returns the number of points in a geometry.

ST_NRINGS

Returns the number of rings in a POLYGON or a MULTIPOLYGON.

ST_SRID

Returns the spatial reference identifier for the underlying object.

ST_NUMGEOMETRIES

Returns the MULTI count of MULTIPOINT, MULTILINESTRING or MULTIPOLYGON. Returns 1 for non-MULTI geometry.

Overlay Functions

Function

Description

ST_INTERSECTION

Returns a geometry representing an intersection of two geometries; that is, the section that is shared between the two input geometries. Performed by the GEOS module.

The output is currently limited to MULTIPOLYGON type, because HEAVY.AI does not support mixed geometry types within a geometry column, and ST_INTERSECTION can potentially return points, lines, and polygons from a single intersection operation. Lower-dimension intersecting features such as points and line strings are returned as very small buffers around those features. If needed, true points can be recovered by applying the ST_CENTROID method to point intersection results. In addition, ST_PERIMETER/2 of resulting line intersection polygons can be used to approximate line length. Empty/NULL geometry outputs are not currently supported.

Examples: SELECT ST_Intersection('POLYGON((0 0,3 0,3 3,0 3))', 'POLYGON((1 1,4 1,4 4,1 4))'); SELECT ST_Area(ST_Intersection(poly, 'POLYGON((1 1,3 1,3 3,1 3,1 1))')) FROM tbl;

ST_DIFFERENCE

Returns a geometry representing the portion of the first input geometry that does not intersect with the second input geometry. Performed by the GEOS module. Input order is important; the return geometry is always a section of the first input geometry.

The output is currently limited to MULTIPOLYGON type, for the same reasons described in ST_INTERSECTION. Similar post-processing methods can be applied if needed. Empty/NULL geometry outputs are not currently supported.

Examples: SELECT ST_Difference('POLYGON((0 0,3 0,3 3,0 3))', 'POLYGON((1 1,4 1,4 4,1 4))'); SELECT ST_Area(ST_Difference(poly, 'POLYGON((1 1,3 1,3 3,1 3,1 1))')) FROM tbl;

ST_UNION

Returns a geometry representing the union (or combination) of the two input geometries. Performed by the GEOS module.

The output is currently limited to MULTIPOLYGON type for the same reasons described in ST_INTERSECTION. Similar post-processing methods can be applied if needed. Empty/NULL geometry outputs are not currently supported.

Examples: SELECT ST_UNION('POLYGON((0 0,3 0,3 3,0 3))', 'POLYGON((1 1,4 1,4 4,1 4))'); SELECT ST_AREA(ST_UNION(poly, 'POLYGON((1 1,3 1,3 3,1 3,1 1))')) FROM tbl;

Spatial Relationships and Measurements

Function

Description

ST_DISTANCE

Returns shortest planar distance between geometries. For example: ST_DISTANCE(poly1, ST_GeomFromText('POINT(0 0)')) Returns shortest geodesic distance between two points, in meters, if given two point geographies. Point geographies can be specified through casts from point geometries or as literals. For example: ST_DISTANCE( CastToGeography(p2), ST_GeogFromText('POINT(2.5559 49.0083)', 4326) )

SELECT a.name, ST_DISTANCE( CAST(a.pt AS GEOGRAPHY), CAST(b.pt AS GEOGRAPHY) ) AS dist_meters FROM starting_point a, destination_points b;

You can also calculate the distance between a POLYGON and a POINT. If both fields use SRID 4326, then the calculated distance is in 4326 units (degrees). If both fields use SRID 4326, and both are transformed into 900913, then the results are in 900913 units (meters).

The following SQL code returns the names of polygons where the distance between the point and polygon is less than 1,000 meters.

SELECT a.poly_name FROM poly a, point b WHERE ST_DISTANCE( ST_TRANSFORM(b.location,900913), ST_TRANSFORM(a.heavyai_geo,900913) ) < 1000;

ST_EQUALS

Returns TRUE if the first input geometry and the second input geometry are spatially equal; that is, they occupy the same space. Different orderings of points can be accepted as equal if they represent the same geometry structure.

POINTs comparison is performed natively. All other geometry comparisons are performed by GEOS.

If input geometries are both uncompressed or compressed, all comparisons to identify equality are precise. For mixed combinations, the comparisons are performed with a compression-specific tolerance that allows recognition of equality despite subtle precision losses that the compression may introduce. Note: Geo columns and literals with SRID=4326 are compressed by default.

Examples: SELECT COUNT(*) FROM tbl WHERE ST_EQUALS('POINT(2 2)', pt); SELECT ST_EQUALS('POLYGON ((0 0,1 0,0 1))', 'POLYGON ((0 0,0 0.5,0 1,1 0,0 0))');

ST_MAXDISTANCE

Returns longest planar distance between geometries. In effect, this is the diameter of a circle that encloses both geometries.For example:

Currently supported variants:

ST_CONTAINS

Returns true if the first geometry object contains the second object. For example:

You can also use ST_CONTAINS to:

  • Return the count of polys that contain the point (here as WKT): SELECT count(*) FROM geo1 WHERE ST_CONTAINS(poly1, 'POINT(0 0)');

  • Return names from a polys table that contain points in a points table: SELECT a.name FROM polys a, points b WHERE ST_CONTAINS(a.heavyai_geo, b.location);

  • Return names from a polys table that contain points in a points table, using a single point in WKT instead of a field in another table: SELECT name FROM poly WHERE ST_CONTAINS( heavyai_geo, ST_GeomFromText('POINT(-98.4886935 29.4260508)', 4326) );

ST_INTERSECTS

Returns true if two geometries intersect spatially, false if they do not share space. For example:

SELECT ST_INTERSECTS( 'POLYGON((0 0, 2 0, 2 2, 0 2, 0 0))', 'POINT(1 1)' ) FROM tbl;

ST_AREA

Returns the area of planar areas covered by POLYGON and MULTIPOLYGON geometries. For example:

SELECT ST_AREA( 'POLYGON((1 0, 0 1, -1 0, 0 -1, 1 0),(0.1 0, 0 0.1, -0.1 0, 0 -0.1, 0.1 0))' ) FROM tbl;

ST_AREA does not support calculation of geographic areas, but rather uses planar coordinates. Geographies must first be projected in order to use ST_AREA. You can do this ahead of time before import or at runtime, ideally using an equal area projection (for example, a national equal-area Lambert projection). The area is calculated in the projection's units. For example, you might use Web Mercator runtime projection to get the area of a polygon in square meters:

ST_AREA( ST_TRANSFORM( ST_GeomFromText( 'POLYGON((-76.6168198439371 39.9703199555959, -80.5189990254673 40.6493554919257, -82.5189990254673 42.6493554919257, -76.6168198439371 39.9703199555959) )', 4326 ), 900913) )

<code></code>

Web Mercator is not an equal area projection, however. Unless compensated by a scaling factor, Web Mercator areas can vary considerably by latitude.

ST_PERIMETER

Returns the cartesian perimeter of POLYGON and MULTIPOLYGON geometries. For example: SELECT ST_PERIMETER('POLYGON( (1 0, 0 1, -1 0, 0 -1, 1 0), (0.1 0, 0 0.1, -0.1 0, 0 -0.1, 0.1 0) )' ) from tbl; It will also return the geodesic perimeter of POLYGON and MULTIPOLYGON geometries. For example:

SELECT ST_PERIMETER( ST_GeogFromText( 'POLYGON( (-76.6168198439371 39.9703199555959, -80.5189990254673 40.6493554919257, -82.5189990254673 42.6493554919257, -76.6168198439371 39.9703199555959) )', 4326) ) from tbl;

ST_LENGTH

Returns the cartesian length of LINESTRING geometries. For example: SELECT ST_LENGTH('LINESTRING(1 0, 0 1, -1 0, 0 -1, 1 0)') FROM tbl; It also returns the geodesic length of LINESTRING geographies. For example:

SELECT ST_LENGTH( ST_GeogFromText('LINESTRING( -76.6168198439371 39.9703199555959, -80.5189990254673 40.6493554919257, -82.5189990254673 42.6493554919257)', 4326) ) FROM tbl;

ST_WITHIN

Returns true if geometry A is completely within geometry B. For example the following SELECT statement returns true:

SELECT ST_WITHIN( 'POLYGON ((1 1, 1 2, 2 2, 2 1))', 'POLYGON ((0 0, 0 3, 3 3, 3 0))' ) FROM tbl;

ST_DWITHIN

Returns true if the geometries are within the specified distance of each one another. Distance is specified in units defined by the spatial reference system of the geometries. For example: SELECT ST_DWITHIN( 'POINT(1 1)', 'LINESTRING (1 2,10 10,3 3)', 2.0 ) FROM tbl; ST_DWITHIN supports geodesic distances between geographies, currently limited to geographic points. For example, you can check whether Los Angeles and Paris, specified as WGS84 geographic point literals, are within 10,000km of one another.

SELECT ST_DWITHIN(

ST_GeogFromText( 'POINT(-118.4079 33.9434)', 4326), ST_GeogFromText('POINT(2.5559 49.0083)', 4326 ), 10000000.0) FROM tbl;

ST_DFULLYWITHIN

Returns true if the geometries are fully within the specified distance of one another. Distance is specified in units defined by the spatial reference system of the geometries. For example: SELECT ST_DFULLYWITHIN( 'POINT(1 1)', 'LINESTRING (1 2,10 10,3 3)', 10.0) FROM tbl; This function supports:

ST_DFULLYWITHIN(POINT, LINESTRING, distance) ST_DFULLYWITHIN(LINESTRING, POINT, distance)

ST_DISJOINT

Returns true if the geometries are spatially disjoint (that is, the geometries do not overlap or touch. For example:

SELECT ST_DISJOINT( 'POINT(1 1)', 'LINESTRING (0 0,3 3)' ) FROM tbl;

Additional Geo Notes

  • You can use SQL code similar to the examples in this topic as global filters in Immerse.

  • CREATE TABLE AS SELECT is not currently supported for geo data types in distributed mode.

  • GROUP BY is not supported for geo types (POINT, MULTIPOINT, LINESTRING, MULTILINESTRING, POLYGON, or MULTIPOLYGON.

  • You can use \d table_name to determine if the SRID is set for the geo field:

    heavysql> \d starting_point
    CREATE TABLE starting_point (
                                   name TEXT ENCODING DICT(32),
                                   myPoint GEOMETRY(POINT, 4326) ENCODING COMPRESSED(32)
                                 )

    If no SRID is returned, you can set the SRID using ST_SETSRID(column_name, SRID). For example, ST_SETSRID(myPoint, 4326).

Arrays

HEAVY.AI supports arrays in dictionary-encoded text and number fields (TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, and DOUBLE). Data stored in arrays are not normalized. For example, {green,yellow} is not the same as {yellow,green}. As with many SQL-based services, OmniSci array indexes are 1-based.

HEAVY.AI supports NULL variable-length arrays for all integer and floating-point data types, including dictionary-encoded string arrays. For example, you can insert NULL into BIGINT[ ], DOUBLE[ ], or TEXT[ ] columns. HEAVY.AI supports NULL fixed-length arrays for all integer and floating-point data types, but not for dictionary-encoded string arrays. For example, you can insert NULL into BIGINT[2] DOUBLE[3], but not into TEXT[2] columns.

Expression
Description

ArrayCol[n] ...

Returns value(s) from specific location n in the array.

UNNEST(ArrayCol)

Extract the values in the array to a set of rows. Requires GROUP BY; projecting UNNEST is not currently supported.

test = ANY ArrayCol

ANY compares a scalar value with a single row or set of values in an array, returning results in which at least one item in the array matches. ANY must be preceded by a comparison operator.

test = ALL ArrayCol

ALL compares a scalar value with a single row or set of values in an array, returning results in which all records in the array field are compared to the scalar value. ALL must be preceded by a comparison operator.

CARDINALITY()

Returns the number of elements in an array. For example:

Examples

The following examples show query results based on the table test_array created with the following statement:

CREATE TABLE test_array (name TEXT ENCODING DICT(32),colors TEXT[] ENCODING DICT(32), qty INT[]);
omnisql> SELECT * FROM test_array;
name|colors|qty
Banana|{green, yellow}|{1, 2}
Cherry|{red, black}|{1, 1}
Olive|{green, black}|{1, 0}
Onion|{red, white}|{1, 1}
Pepper|{red, green, yellow}|{1, 2, 3}
Radish|{red, white}|{}
Rutabaga|NULL|{}
Zucchini|{green, yellow}|{NULL}
omnisql> SELECT UNNEST(colors) AS c FROM test_array;
Exception: UNNEST not supported in the projection list yet.
omnisql> SELECT UNNEST(colors) AS c, count(*) FROM test_array group by c;
c|EXPR$1
green|4
yellow|3
red|4
black|2
white|2
omnisql> SELECT name, colors [2] FROM test_array;
name|EXPR$1
Banana|yellow
Cherry|black
Olive|black
Onion|white
Pepper|green
Radish|white
Rutabaga|NULL
Zucchini|yellow
omnisql> SELECT name, colors FROM test_array WHERE colors[1]='green';
name|colors
Banana|{green, yellow}
Olive|{green, black}
Zucchini|{green, yellow}
omnisql> SELECT * FROM test_array WHERE colors IS NULL;
name|colors|qty
Rutabaga|NULL|{}

The following queries use arrays in an INTEGER field:

omnisql> SELECT name, qty FROM test_array WHERE qty[2] >1;
name|qty
Banana|{1, 2}
Pepper|{1, 2, 3}
omnisql> SELECT name, qty FROM test_array WHERE 15< ALL qty;
No rows returned.
omnisql> SELECT name, qty FROM test_array WHERE 2 = ANY qty;
name|qty
Banana|{1, 2}
Pepper|{1, 2, 3}
omnisql> SELECT COUNT(*) FROM test_array WHERE qty IS NOT NULL;
EXPR$0
8
omnisql> SELECT COUNT(*) FROM test_array WHERE CARDINALITY(qty)<0;
EXPR$0
6

Table Expression and Join Support

<table> , <table> WHERE <column> = <column>
<table> [ LEFT ] JOIN <table> ON <column> = <column>

If a join column name or alias is not unique, it must be prefixed by its table name.

Geospatial Joins

When possible, joins involving a geospatial operator (such as ST_Contains) build a binned spatial hash table (overlaps hash join), falling back to a Cartesian loop join if a spatial hash join cannot be constructed.

The enable-overlaps-hashjoin flag controls whether the system attempts to use the overlaps spatial join strategy (true by default). If enable-overlaps-hashjoin is set to false, or if the system cannot build an overlaps hash join table for a geospatial join operator, the system attempts to fall back to a loop join. Loop joins can be performant in situations where one or both join tables have a small number of rows. When both tables grow large, loop join performance decreases.

Two flags control whether or not the system allows loop joins for a query (geospatial for not): allow-loop-joins and trivial-loop-join-threshold. By default, allow-loop-joins is set to false and trivial-loop-join-threshold to 1,000 (rows). If allow allow-loop-joins is set to true, the system allows any query with a loop join, regardless of table cardinalities (measured in number of rows). If left to the implicit default of false or set explicitly to false, the system allows loop join queries as long as the inner table (right-side table) has fewer rows than the threshold specified by trivial-loop-join-threshold.

For optimal performance, the system should utilize overlaps hash joins whenever possible. Use the following guidelines to maximize the use of the overlaps hash join framework and minimize fallback to loop joins when conducting geospatial joins:

  • The inner (right-side) table should always be the more complicated primitive. For example, for ST_Contains(polygon, point), the point table should be the outer (left) table and the polygon table should be the inner (right) table.

  • Currently, ST_CONTAINS and ST_INTERSECTS joins between point and polygons/multi-polygon tables, and ST_DISTANCE < {distance} between two point tables are supported for accelerated overlaps hash join queries.

  • For pointwise-distance joins, only the pattern WHERE ST_DISTANCE(table_a.point_col, table_b.point_col) < distance_in_degrees supports overlaps hash joins. Patterns like the following fall back to loop joins:

    • WHERE ST_DWITHIN(table_a.point_col, table_b.point_col, distance_in_degrees)

    • WHERE ST_DISTANCE(ST_TRANSFORM(table_a.point_col, 900913), ST_TRANSFORM(table_b.point_col, 900913)) < 100

Using Joins in a Distributed Environment

You can create joins in a distributed environment in two ways:

  • Replicate small dimension tables that are used in the join.

  • Create a shard key on the column used in the join (note that there is a limit of one shard key per table). If the column involved in the join is a TEXT ENCODED field, you must create a SHARED DICTIONARY that references the FACT table key you are using to make the join.

# Table customers is very small
CREATE TABLE sales (
id INTEGER,
customerid TEXT ENCODING DICT(32),
saledate DATE ENCODING DAYS(32),
saleamt DOUBLE);

CREATE TABLE customers (
id TEXT ENCODING DICT(32),
someid INTEGER,
name TEXT ENCODING DICT(32))
WITH (partitions = 'replicated') #this causes the entire contents of this table to be replicated to each leaf node. Only recommened for small dimension tables.
SELECT c.id, c.name from sales s inner join customers c on c.id = s.customerid limit 10;
CREATE TABLE sales (
id INTEGER,
customerid BIGINT, #note the numeric datatype, so we don't need to specify a shared dictionary on the customer table
saledate DATE ENCODING DAYS(32),
saleamt DOUBLE,
SHARD KEY (customerid))
WITH (SHARD_COUNT = <num gpus in cluster>)

CREATE TABLE customers (
id TEXT BIGINT,
someid INTEGER,
name TEXT ENCODING DICT(32)
SHARD KEY (id))
WITH (SHARD_COUNT=<num gpus in cluster>);

SELECT c.id, c.name FROM sales s INNER JOIN customers c ON c.id = s.customerid LIMIT 10;
CREATE TABLE sales (
id INTEGER,
customerid TEXT ENCODING DICT(32),
saledate DATE ENCODING DAYS(32),
saleamt DOUBLE,
SHARD KEY (customerid))
WITH (SHARD_COUNT = <num gpus in cluster>)

#note the difference when customerid is a text encoded field:

CREATE TABLE customers (
id TEXT,
someid INTEGER,
name TEXT ENCODING DICT(32),
SHARD KEY (id),
SHARED DICTIONARY (id) REFERENCES sales(customerid))
WITH (SHARD_COUNT = <num gpus in cluster>)

SELECT c.id, c.name FROM sales s INNER JOIN customers c ON c.id = s.customerid LIMIT 10;

The join order for one small table and one large table matters. If you swap the sales and customer tables on the join, it throws an exception stating that table "sales" must be replicated.

System Table Functions

To improve performance, table functions can be declared to enable filter pushdown optimization, which allows the Calcite optimizer to "push down" filters on the output(s) of a table functions to its input(s) when the inputs and outputs are declared to be semantically equivalent (for example, a longitude variable that is input and output from a table function). This can significantly increase performance in cases where only a small portion of one or more input tables is required to compute the filtered output of a table function.

Whether system- or user-provided, table functions can execute over one or more result sets specified by subqueries, and can also take any number of additional constant literal arguments specified in the function definition. SQL subquery inputs can consist of any SQL expression (including multiple subqueries, joins, and so on) allowed by HeavyDB, and the output can be filtered, grouped by, joined, and so on like a normal SQL subquery, including being input into additional table functions by wrapping it in a CURSOR argument. The number and types of input arguments, as well as the number and types of output arguments, are specified in the table function definition itself.

Table functions allow for the efficient execution of advanced algorithms that may be difficult or impossible to express in canonical SQL. By allowing execution of code directly over SQL result sets, leveraging the same hardware parallelism used for fast SQL execution and visualization rendering, HEAVY.AI provides orders-of-magnitude speed increases over the alternative of transporting large result sets to other systems for post-processing and then returning to HEAVY.AI for storage or downstream manipulation. You can easily invoke system-provided or user-defined algorithms directly inline with SQL and rendering calls, making prototyping and deployment of advanced analytics capabilities easier and more streamlined.

Concepts

CURSOR Subquery Inputs

Table functions can take as input arguments both constant literals (including scalar results of subqueries) as well as results of other SQL queries (consisting of one or more rows). The latter (SQL query inputs), per the SQL standard, must be wrapped in the keyword CURSOR. Depending on the table function, there can be 0, 1, or multiple CURSOR inputs. For example:

SELECT * FROM (TABLE(my_table_function /* This is only an example! */ (
 CURSOR(SELECT arg1, arg2, arg3 FROM input_1 WHERE x > 10) /* First CURSOR 
 argument consisting of 3 columns */,
 CURSOR(SELECT arg1, AVG(arg2) FROM input_2 GROUP BY arg1 where y < 40) 
 /* Second CURSOR argument constisting of 2 columns. This could be from the same
 table as the first CURSOR, or as is the case here, a completely different table
 (or even joined table or logical value expression) */,
 'Fred' /* TEXT constant literal argument */,
 true /* BOOLEAN constant literal argument */,
 (SELECT COUNT(*) FROM another_table), /* scalar subquery results do not need
 to be wrapped in a CURSOR */,
 27.3 /* FLOAT constant literal argument */))
WHERE output1 BETWEEN 32.2 AND 81.8;

ColumnList Inputs

Certain table functions can take 1 or more columns of a specified type or types as inputs, denoted as ColumnList<TYPE1 | Type2... TypeN>. Even if a function allows aColumnList input of multiple types, the arguments must be all of one type; types cannot be mixed. For example, if a function allows ColumnList<INT | TEXT ENCODING DICT>, one or more columns of either INTEGER or TEXT ENCODING DICT can be used as inputs, but all must be either INT columns or TEXT ENCODING DICT columns.

Named Arguments

All HEAVY.AI system table functions allow you to specify argument either in conventional comma-separated form in the order specified by the table function signature, or alternatively via a key-value map where input argument names are mapped to argument values using the => token. For example, the following two calls are equivalent:

/* The following two table function calls, the first with unnamed
 signature-ordered arguments, and the second with named arguments,
 are equivalent */

select
  *
from
  table(
    tf_compute_dwell_times(
      /* Without the use of named arguments, input arguments must
      be ordered as specified by the table function signature */
      cursor(
        select
          user_id,
          movie_id,
          ts
        from
          netflix_audience_behavior
      ),
      3,
      600,
      10800
    )
  )
order by
  num_dwell_points desc
limit
  10;


select
  *
from
  table(
    tf_compute_dwell_times(
     /* Using named arguments, input arguments can be
     ordered in any order, as long as all arguments are named */
     min_dwell_seconds => 600,
     max_inactive_seconds => 10800
      data => cursor(
        select
          user_id,
          movie_id,
          ts
        from
          netflix_audience_behavior
      ),
      min_dwell_points => 3
    )
  )
order by
  num_dwell_points desc
limit
  10;

Filter Push-Down

For performance reasons, particularly when table functions are used as actual tables in a client like Heavy Immerse, many system table functions in HEAVY.AI automatically "push down" filters on certain output columns in the query onto the inputs. For example, if a table does some computation over an x and y range such that x and y are in both the input and output for the table function, filter push-down would likely be enabled so that a query like the following would automatically push down the filter on the x and y outputs to the x and y inputs. This potentially increases query performance significantly.

SELECT
  *
FROM
  TABLE(
    my_spatial_table_function(
      CURSOR(
        SELECT
          x,
          y
        from
          spatial_data_table
          /* Presuming filter push down is enabled for 
          my_spatial_table_function, the filter applied to 
          x and y will be applied here to the table function
          input CURSOR */
      )
    )
  )
WHERE
  x BETWEEN 38.2
  AND 39.1
  and Y BETWEEN -121.4
  and -120.1;

To determine whether filter push-down is used, you can check the Boolean value of the filter_table_transpose column from the query:

SHOW TABLE FUNCTIONS DETAILS <table_function_name>;

Currently for system table functions, you cannot change push-down behavior.

Querying Registered Table Functions

You can query which table functions are available using SHOW TABLE FUNCTIONS:

SHOW TABLE FUNCTIONS;

Table UDF

tf_feature_similarity
tf_feature_self_similarity
tf_geo_rasterize_slope
...

Query Metadata for a Specific Table Function

Information about the expected input and output argument names and types, as well as other info such as whether the function can run on CPU, GPU or both, and whether filter push-down is enabled, can be queried via SHOW TABLE FUNCTIONS DETAILS <table_function_name>;

SHOW TABLE FUNCTIONS DETAILS <table_function_name>;

name|signature|input_names|input_types|output_names|output_types|CPU|GPU|Runtime|filter_table_transpose
generate_series|(i64 series_start, i64 series_stop, i64 series_step) -> Column<i64>|[series_start, series_stop, series_step]|[i64, i64, i64]|[generate_series]|[Column<i64>]|true|false|false|false
generate_series|(i64 series_start, i64 series_stop) -> Column<i64>|[series_start, series_stop]|[i64, i64]|[generate_series]|[Column<i64>]|true|false|false|false

System Table Functions

The following system table functions are available in HEAVY.AI. The table provides a summary and links to more inforamation about each function.

Function
Purpose

Generates random string data.

Generates a series of integer values.

Generates a series of timestamp values from start_timestamp to end_timestamp.

Given a query input with entity keys and timestamps, and parameters specifying the minimum session time, the minimum number of session records, and the max inactive seconds, outputs all unique sessions found in the data with the duration of the session.

Given a query input of entity keys/IDs, a set of feature columns, and a metric column, scores each pair of entities based on their similarity. The score is computed as the cosine similarity of the feature column(s) between each entity pair, which can optionally be TF/IDF weighted.

Given a query input of entity keys, feature columns, and a metric column, and a second query input specifying a search vector of feature columns and metric, computes the similarity of each entity in the first input to the search vector based on their similarity. The score is computed as the cosine similarity of the feature column(s) for each entity with the feature column(s) for the search vector, which can optionally be TF/IDF weighted.

Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, with taking the maximum z value across all points in each bin as the output value for the bin. The aggregate performed to compute the value for each bin is specified by agg_type, with allowed aggregate types of AVG, COUNT, SUM, MIN, and MAX.

Similar to tf_geo_rasterize, but also computes the slope and aspect per output bin. Aggregates point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type) across all points in each bin as the output value for the bin.

Given a distance-weighted directed graph, consisting of a queryCURSOR input consisting of the starting and ending node for each edge and a distance, and a specified origin and destination node, computes the shortest distance-weighted path through the graph between origin_node and destination_node.

Given a distance-weighted directed graph, consisting of a queryCURSOR input consisting of the starting and ending node for each edge and a distance, and a specified origin node, computes the shortest distance-weighted path distance between the origin_node and every other node in the graph.

Loads one or more las or laz point cloud/LiDAR files from a local file or directory source, optionally tranforming the output SRID to out_srs. If not specified, output points are automatically transformed to EPSG:4326 lon/lat pairs).

Returns metadata for one or more las or laz point cloud/LiDAR files from a local file or directory source, optionally constraining the bounding box for metadata retrieved to the lon/lat bounding box specified by the x_min, x_max, y_min, y_max arguments.

Process a raster input to derive contour lines or regions and output as LINESTRING or POLYGON for rendering or further processing.

Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type) across all points in each bin as the output value for the bin.

Used for generating top-k signals where 'k' represents the maximum number of antennas to consider at each geographic location. The full relevant parameter name is strongest_k_sources_per_terrain_bin.

Taking a set of point elevations and a set of signal source locations as input, tf_rf_prop_max_signal executes line-of-sight 2.5D RF signal propagation from the provided sources over a binned 2.5D elevation grid derived from the provided point locations, calculating the max signal in dBm at each grid cell, using the formula for free-space power loss.

The TABLE command is required to wrap a table function clause; for example: select * from TABLE(generate_series(1, 10));

The CURSOR command is required to wrap any subquery inputs.

Functions and Operators

Functions and Operators (DML)

Basic Mathematical Operators

Operator

Description

+numeric

Returns numeric

–numeric

Returns negative value of numeric

numeric1 + numeric2

Sum of numeric1 and numeric2

numeric1 – numeric2

Difference of numeric1 and numeric2

numeric1 * numeric2

Product of numeric1 and numeric2

numeric1 / numeric2

Quotient (numeric1 divided by numeric2)

Mathematical Operator Precedence

  1. Parenthesization

  2. Multiplication and division

  3. Addition and subtraction

Comparison Operators

Operator

Description

=

Equals

<>

Not equals

>

Greater than

>=

Greater than or equal to

<

Less than

<=

Less than or equal to

BETWEEN x AND y

Is a value within a range

NOT BETWEEN x AND y

Is a value not within a range

IS NULL

Is a value that is null

IS NOT NULL

Is a value that is not null

NULLIF(x, y)

Compare expressions x and y. If different, return x. If they are the same, return null. For example, if a dataset uses ‘NA’ for null values, you can use this statement to return null using SELECT NULLIF(field_name,'NA').

IS TRUE

True if a value resolves to TRUE.

IS NOT TRUE

True if a value resolves to FALSE.

Mathematical Functions

Function

Description

ABS(x)

Returns the absolute value of x

CEIL(x)

Returns the smallest integer not less than the argument

DEGREES(x)

Converts radians to degrees

EXP(x)

Returns the value of e to the power of x

FLOOR(x)

Returns the largest integer not greater than the argument

LN(x)

Returns the natural logarithm of x

LOG(x)

Returns the natural logarithm of x

LOG10(x)

Returns the base-10 logarithm of the specified float expression x

MOD(x,y)

Returns the remainder of int x divided by int y

PI()

Returns the value of pi

POWER(x,y)

Returns the value of x raised to the power of y

RADIANS(x)

Converts degrees to radians

ROUND(x)

Rounds x to the nearest integer value, but does not change the data type. For example, the double value 4.1 rounds to the double value 4.

ROUND_TO_DIGIT (x,y)

Rounds x to y decimal places

SIGN(x)

Returns the sign of x as -1, 0, 1 if x is negative, zero, or positive

SQRT(x)

Returns the square root of x.

TRUNCATE(x,y)

Truncates x to y decimal places

WIDTH_BUCKET(target,lower-boundary,upper-boundary,bucket-count)

Define equal-width intervals (buckets) in a range between the lower boundary and the upper boundary, and returns the bucket number to which the target expression is assigned.

  • target - A constant, column variable, or general expression for which a bucket number is returned.

  • lower-boundary - Lower boundary for the range of values to be partitioned equally.

  • upper-boundary - Upper boundary for the range of values to be partitioned equally.

  • partition_count - Number of equal-width buckets in the range defined by the lower and upper boundaries.

Expressions can be constants, column variables, or general expressions.

Example Create 10 age buckets of equal size, with lower bound 0 and upper bound 100 ([0,10], [10,20]... [90,100]), and classify the

age of a customer accordingly:

SELECT WIDTH_BUCKET(age, 0, 100, 10) FROM customer;

For example, a customer of age 34 is assigned to bucket 3 ([30,40]) and the function returns the value 3.

Trigonometric Functions

Function

Description

ACOS(x)

Returns the arc cosine of x

ASIN(x)

Returns the arc sine of x

ATAN(x)

Returns the arc tangent of x

ATAN2(y,x)

Returns the arc tangent of (x, y) in the range (-π,π]. Equal to ATAN(y/x) for x > 0.

COS(x)

Returns the cosine of x

COT(x)

Returns the cotangent of x

SIN(x)

Returns the sine of x

TAN(x)

Returns the tangent of x

Geometric Functions

Function

Description

DISTANCE_IN_METERS(fromLon, fromLat, toLon, toLat)

Calculates distance in meters between two WGS84 positions.

CONV_4326_900913_X(x)

Converts WGS84 latitude to WGS84 Web Mercator x coordinate.

CONV_4326_900913_Y(y)

Converts WGS84 longitude to WGS84 Web Mercator y coordinate.

String Functions

Function

Description

BASE64_DECODE(str)

Decodes a BASE64-encoded string.

BASE64_ENCODE(str)

Encodes a string to a BASE64-encoded string.

CHAR_LENGTH(str)

Returns the number of characters in a string. Only works with unencoded fields (ENCODING set to none).

str1 || str2 [ || str3... ]

Returns the string that results from concatenating the strings specified. Note that numeric, date, timestamp, and time types will be implicitly casted to strings as necessary, so explicit casts of non-string types to string types is not required for inputs to the concatenation operator. Note that concatenating a variable string with a string literal, i.e. county_name || ' County' is significantly more performant than concatenating two or more variable strings, i.e. county_name || ', ' || state_name. Hence for for multi-variable string concatenation, it is recommended to use an update statement to materialize the concatenated output rather than performing it inline when such operations are expected to be routinely repeated.

ENCODE_TEXT(none_encoded_str)

Converts a none-encoded string to a transient dictionary-encoded string to allow for operations like group-by on top. When the watchdog is enabled, the number of strings that can be casted using this operator is capped by the value set with the watchdog-none-encoded-string-translation-limit flag (1,000,000 by default).

HASH(str)

Deterministically Hashes a string input to a BIGINT output using a pseudo-random function. Can be useful for bucketing string values or deterministcally coloring by string values for a high-cardinality TEXT column. Note that currently HASH only accepts TEXT inputs, but in the future may also accept other data types. It should also be noted that NULL values always hash to NULL outputs.

INITCAP(str)

Returns the string with initial caps after any of the defined delimiter characters, with the remainder of the characters lowercased. Valid delimiter characters are !, ?, @, ", ^, #, $, &, ~, _, ,, ., :, ;, +, -, *, %, /, |, \, [, ], (, ), {, }, <, >.

JAROWINKLER_SIMILARITY( str1, str2 )

Computes the Jaro-Winkler similarity score between two input strings. The output will be an integer between 0 and 100, with 0 representing completely dissimilar strings, and 100 representing exactly matching strings.

JSON_VALUE(json_str, path)

Returns the string of a field given by path instr. Paths start with the $ character, with sub-fields split by . and array members indexed by [], with array indices starting at 0. For example, JSON_VALUE('{"name": "Brenda", "scores": [89, 98, 94]}', '$.scores[1]') would yield a TEXT return field of '98'. Note that currentlyLAX parsing mode (any unmatched path returns null rather than errors) is the default, and STRICT parsing mode is not supported.

KEY_FOR_STRING(str)

Returns the dictionary key of a dictionary-encoded string column.

LCASE(str)

Returns the string in all lower case. Only ASCII character set is currently supported. Same as LOWER.

LEFT(str, num)

Returns the left-most number (num) of characters in the string (str).

LENGTH(str)

Returns the length of a string in bytes. Only works with unencoded fields (ENCODING set to none).

LEVENSHTEIN_DISTANCE( str1, str2 )

Computes the edit distance, or number of single-character insertions, deletions, or substitutions, that must be made to make the first string equal the second. It returns an integer greater than or equal to 0, with 0 meaning the strings are equal. The higher the return value, the more the two strings can be thought of as dissimilar.

LOWER(str)

Returns the string in all lower case. Only ASCII character set is currently supported. Same as LCASE.

LPAD(str, len, [lpad_str ])

Left-pads the string with the string defined in lpad_str to a total length of len. If the optional lpad_str is not specified, the space character is used to pad. If the length of str is greater than len, then characters from the end of str are truncated to the length of len. Characters are added from lpad_str successively until the target length len is met. If lpad_str concatenated with str is not long enough to equal the target len, lpad_str is repeated, partially if necessary, until the target length is met.

LTRIM(str, chars)

Removes any leading characters specified in chars from the string. Alias for TRIM.

OVERLAY(strPLACING replacement_strFROM start [FORlen])

Replaces in str the number of characters defined in len with characters defined in replacement_str at the location start. Regardless of the length of replacement_str, len characters are removed from str unless start + replacement_str is greater than the length of str, in which case all characters from start to the end of str are replaced. Ifstart is negative, it specifies the number of characters from the end of str.

POSITION ( search_str IN str [FROM start_position])

Returns the position of the first character in search_str if found in str, optionally starting the search at start_position. If search_str is not found, 0 is returned. If search_str or str are null, null is returned.

REGEXP_COUNT(str, pattern [, position, [flags]])

Returns the number of times that the provided pattern occurs in the search string str. position specifies the starting position in str for which the search for pattern will start (all matches before position will be ignored. If position is negative, the search will start that many characters from the end of the string str. Use the following optional flags to control the matching behavior: c - Case-sensitive matching. i - Case-insensitive matching.

REGEXP_REPLACE(str, pattern [, new_str, position, occurrence, [flags]])

Replace one or all matches of a substring in string str that matches pattern , which is a regular expression in POSIX regex syntax.

new_str (optional) is the string that replaces the string matching the pattern. If new_str is empty or not supplied, all found matches are removed.

The occurrence integer argument (optional) specifies the single match occurrence of the pattern to replace, starting from the beginning of str; 0 (replace all) is the default. Use a negative occurrence argument to signify the nth-to-last occurrence to be replaced.

Use a positive position argument to indicate the number of characters from the beginning of str. Use a negative position argument to indicate the number of characters from the end of str.

Back-references/capture groups can be used to capture and replace specific sub-expressions.

Use the following optional flags to control the matching behavior: c - Case-sensitive matching. i - Case-insensitive matching.

If not specified, REGEXP_REPLACE defaults to case sensitive search.

REGEXP_SUBSTR(str, pattern [, position, occurrence, flags, group_num])

Use position to set the character position to begin searching. Use occurrence to specify the occurrence of the pattern to match.

Use a positive position argument to indicate the number of characters from the beginning of str. Use a negative position argument to indicate the number of characters from the end of str.

The occurrence integer argument (optional) specifies the single match occurrence of the pattern to replace, with 0 being mapped to the first (1) occurrence. Use a negative occurrence argument to signify the nth-to-last group in pattern is returned.

Use optional flags to control the matching behavior: c - Case-sensitive matching.

e - Extract submatches. i - Case-insensitive matching.

The c and i flags cannot be used together; e can be used with either. If neither c nor i are specified, or if pattern is not provided, REGEXP_SUBSTR defaults to case-sensitive search.

If the e flag is used, REGEXP_SUBSTR returns the capture group group_num of pattern matched in str. If the e flag is used, but no capture groups are provided in pattern, REGEXP_SUBSTR returns the entire matching pattern, regardless of group_num. If the e flag is used but no group_num is provided, a value of 1 for group_num is assumed, so the first capture group is returned.

REPEAT(str, num)

Repeats the string the number of times defined in num.

REPLACE(str, from_str, new_str)

Replaces all occurrences of substring from_str within a string, with a new substring new_str.

REVERSE(str)

Reverses the string.

RIGHT(str, num)

Returns the right-most number (num) of characters in the string (str).

RPAD(str, len, rpad_str)

Right-pads the string with the string defined in rpad_str to a total length of len. If the optional rpad_str is not specified, the space character is used to pad. If the length of str is greater than len, then characters from the beginning of str are truncated to the length of len. Characters are added from rpad_str successively until the target length len is met. If rpad_str concatenated with str is not long enough to equal the target len, rpad_str is repeated, partially if necessary, until the target length is met.

RTRIM(str)

Removes any trailing spaces from the string.

SPLIT_PART(str, delim, field_num)

Split the string based on a delimiter delim and return the field identified by field_num. Fields are numbered from left to right.

STRTOK_TO_ARRAY(str, [delim])

Tokenizes the string str using optional delimiter(s) delim and returns an array of tokens. An empty array is returned if no tokens are produced in tokenization. NULL is returned if either parameter is a NULL.

SUBSTR(str, start, [len])

Alias for SUBSTRING.

SUBSTRING(str FROM start [ FOR len])

Returns a substring of str starting at index start for len characters.

The start position is 1-based (that is, the first character of str is at index 1, not 0). However, start 0 aliases to start 1.

If start is negative, it is considered to be |start| characters from the end of the string.

If len is not specified, then the substring from start to the end of str is returned.

If len is not specified, then the substring from start to the end of str is returned.

If start + len is greater than the length of str, then the characters in str from start to the end of the string are returned.

TRIM([BOTH | LEADING | TRAILING] [trim_str FROM str])

Removes characters defined in trim_str from the beginning, end, or both of str. If trim_str is not specified, the space character is the default. If the trim location is not specified, defined characters are trimmed from both the beginning and end of str.

TRY_CAST( str AS type)

Attempts to cast/convert a string type to any valid numeric, timestamp, date, or time type. If the conversion cannot be performed, null is returned. Note that TRY_CAST is not valid for non-string input types.

UCASE(str)

Returns the string in uppercase format. Only ASCII character set is currently supported. Same as UPPER.

UPPER(str)

Returns the string in uppercase format. Only ASCII character set is currently supported. Same as UCASE.

URL_DECODE( str )

Decode a url-encoded string. This is the inverse of the URL_ENCODE function.

URL_ENCODE( str )

Url-encode a string. Alphanumeric and the 4 characters: _-.~ are untranslated. The space character is translated to +. All other characters are translated into a 3-character sequence %XX where XX is the 2-digit hexadecimal ASCII value of the character.

Pattern-Matching Functions

Name

Example

Description

str LIKE pattern

'ab' LIKE 'ab'

Returns true if the string matches the pattern (case-sensitive)

str NOT LIKE pattern

'ab' NOT LIKE 'cd'

Returns true if the string does not match the pattern

str ILIKE pattern

'AB' ILIKE 'ab'

Returns true if the string matches the pattern (case-insensitive). Supported only when the right side is a string literal; for example, colors.name ILIKE 'b%

str REGEXP POSIX pattern

'^[a-z]+r$'

Lowercase string ending with r

REGEXP_LIKE ( str , POSIX pattern )

'^[hc]at'

cat or hat

Usage Notes

The following wildcard characters are supported by LIKE and ILIKE:

  • % matches any number of characters, including zero characters.

  • _ matches exactly one character.

Date/Time Functions

Function

Description

CURRENT_DATE

CURRENT_DATE()

Returns the current date in the GMT time zone.

Example:

SELECT CURRENT_DATE();

CURRENT_TIME

CURRENT_TIME()

Returns the current time of day in the GMT time zone.

Example:

SELECT CURRENT_TIME();

CURRENT_TIMESTAMP

CURRENT_TIMESTAMP()

Return the current timestamp in the GMT time zone. Same as NOW().

Example:

SELECT CURRENT_TIMESTAMP();

DATEADD('date_part', interval, date | timestamp)

Returns a date after a specified time/date interval has been added.

Example:

SELECT DATEADD('MINUTE', 6000, dep_timestamp) Arrival_Estimate FROM flights_2008_10k LIMIT 10;

DATEDIFF('date_part', date, date)

Returns the difference between two dates, calculated to the lowest level of the date_part you specify. For example, if you set the date_part as DAY, only the year, month, and day are used to calculate the result. Other fields, such as hour and minute, are ignored.

Example:

SELECT DATEDIFF('YEAR', plane_issue_date, now()) Years_In_Service FROM flights_2008_10k LIMIT 10;

DATEPART('interval', date | timestamp)

Returns a specified part of a given date or timestamp as an integer value. Note that 'interval' must be enclosed in single quotes.

Example:

SELECT DATEPART('YEAR', plane_issue_date) Year_Issued FROM flights_2008_10k LIMIT 10;

DATE_TRUNC(date_part, timestamp)

Truncates the timestamp to the specified date_part. DATE_TRUNC(week,...) starts on Monday (ISO), which is different than EXTRACT(dow,...), which starts on Sunday.

Example:

SELECT DATE_TRUNC(MINUTE, arr_timestamp) Arrival FROM flights_2008_10k LIMIT 10;

EXTRACT(date_part FROM timestamp)

Returns the specified date_part from timestamp.

Example:

SELECT EXTRACT(HOUR FROM arr_timestamp) Arrival_Hour FROM flights_2008_10k LIMIT 10;

INTERVAL 'count' date_part

Adds or Subtracts count date_part units from a timestamp. Note that 'count' is enclosed in single quotes.

Example:

SELECT arr_timestamp + INTERVAL '10' YEAR FROM flights_2008_10k LIMIT 10;

NOW()

Return the current timestamp in the GMT time zone. Same as CURRENT_TIMESTAMP().

Example:

NOW();

TIMESTAMPADD(date_part, count, timestamp | date)

Adds an interval of count date_part to timestamp or date and returns signed date_part units in the provided timestamp or date form.

Example:

SELECT TIMESTAMPADD(DAY, 14, arr_timestamp) Fortnight FROM flights_2008_10k LIMIT 10;

TIMESTAMPDIFF(date_part, timestamp1, timestamp2)

Subtracts timestamp1 from timestamp2 and returns the result in signed date_part units.

Example:

SELECT TIMESTAMPDIFF(MINUTE, arr_timestamp, dep_timestamp) Flight_Time FROM flights_2008_10k LIMIT 10;

Supported Types

Supported date_part types:

DATE_TRUNC [YEAR, QUARTER, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, 
            MICROSECOND, NANOSECOND, MILLENNIUM, CENTURY, DECADE, WEEK, 
            WEEK_SUNDAY, QUARTERDAY]
EXTRACT    [YEAR, QUARTER, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, 
            MICROSECOND, NANOSECOND, DOW, ISODOW, DOY, EPOCH, QUARTERDAY, 
            WEEK, WEEK_SUNDAY, DATEEPOCH]
DATEDIFF   [YEAR, QUARTER, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, 
            MICROSECOND, NANOSECOND, WEEK]

Supported interval types:

DATEADD       [DECADE, YEAR, QUARTER, MONTH, WEEK, WEEKDAY, DAY, 
               HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, NANOSECOND]
TIMESTAMPADD  [YEAR, QUARTER, MONTH, WEEKDAY, DAY, HOUR, MINUTE,
               SECOND, MILLISECOND, MICROSECOND, NANOSECOND]
DATEPART      [YEAR, QUARTER, MONTH, DAYOFYEAR, QUARTERDAY, WEEKDAY, DAY, HOUR,
               MINUTE, SECOND, MILLISECOND, MICROSECOND, NANOSECOND]

Accepted Date, Time, and Timestamp Formats

Datatype

Formats

Examples

DATE

YYYY-MM-DD

2013-10-31

DATE

MM/DD/YYYY

10/31/2013

DATE

DD-MON-YY

31-Oct-13

DATE

DD/Mon/YYYY

31/Oct/2013

EPOCH

1383262225

TIME

HH:MM

23:49

TIME

HHMMSS

234901

TIME

HH:MM:SS

23:49:01

TIMESTAMP

DATE TIME

31-Oct-13 23:49:01

TIMESTAMP

DATETTIME

31-Oct-13T23:49:01

TIMESTAMP

DATE:TIME

11/31/2013:234901

TIMESTAMP

DATE TIME ZONE

31-Oct-13 11:30:25 -0800

TIMESTAMP

DATE HH.MM.SS PM

31-Oct-13 11.30.25pm

TIMESTAMP

DATE HH:MM:SS PM

31-Oct-13 11:30:25pm

TIMESTAMP

1383262225

Usage Notes

  • For two-digit years, years 69-99 are assumed to be previous century (for example, 1969), and 0-68 are assumed to be current century (for example, 2016).

  • For four-digit years, negative years (BC) are not supported.

  • Hours are expressed in 24-hour format.

  • When time components are separated by colons, you can write them as one or two digits.

  • Months are case insensitive. You can spell them out or abbreviate to three characters.

  • For timestamps, decimal seconds are ignored. Time zone offsets are written as +/-HHMM.

  • For timestamps, a numeric string is converted to +/- seconds since January 1, 1970. Supported timestamps range from -30610224000 (January 1, 1000) through 29379456000 (December 31, 2900).

  • On output, dates are formatted as YYYY-MM-DD. Times are formatted as HH:MM:SS.

  • Linux EPOCH values range from -30610224000 (1/1/1000) through 185542587100800 (1/1/5885487). Complete range in years: +/-5,883,517 around epoch.

Statistical and Aggregate Functions

Both double-precision (standard) and single-precision floating point statistical functions are provided. Single-precision functions run faster on GPUs but might cause overflow errors.

Double-precision FP Function

Single-precision FP Function

Description

AVG(x)

Returns the average value of x

COUNT()

Returns the count of the number of rows returned

COUNT(DISTINCT x)

Returns the count of distinct values of x

APPROX_COUNT_DISTINCT(x, e)

Returns the approximate count of distinct values of x with defined expected error rate e, where e is an integer from 1 to 100. If no value is set for e, the approximate count is calculated using the system-widehll-precision-bits configuration parameter.

APPROX_MEDIAN(x)

Returns the approximate median of x. Two server configuration parameters affect memory usage:

APPROX_PERCENTILE(x,y)

Returns the approximate quantile of x, where y is the value between 0 and 1.

For example, y=0 returns MIN(x), y=1 returns MAX(x), and y=0.5 returns APPROX_MEDIAN(x).

MAX(x)

Returns the maximum value of x

MIN(x)

Returns the minimum value of x

SINGLE_VALUE

Returns the input value if there is only one distinct value in the input; otherwise, the query fails.

SUM(x)

Returns the sum of the values of x

SAMPLE(x)

Returns one sample value from aggregated column x. For example, the following query returns population grouped by city, along with one value from the state column for each group:

Note: This was previously LAST_SAMPLE, which is now deprecated.

CORRELATION(x, y)

CORRELATION_FLOAT(x, y)

Alias of CORR. Returns the coefficient of correlation of a set of number pairs.

CORR(x, y)

CORR_FLOAT(x, y)

Returns the coefficient of correlation of a set of number pairs.

COUNT_IF(conditional_expr)

Returns the number of rows satisfying the given condition_expr.

COVAR_POP(x, y)

COVAR_POP_FLOAT(x, y)

Returns the population covariance of a set of number pairs.

COVAR_SAMP(x, y)

COVAR_SAMP_FLOAT(x, y)

Returns the sample covariance of a set of number pairs.

STDDEV(x)

STDDEV_FLOAT(x)

Alias of STDDEV_SAMP. Returns sample standard deviation of the value.

STDDEV_POP(x)

STDDEV_POP_FLOAT(x)

Returns the population standard the standard deviation of the value.

STDDEV_SAMP(x)

STDDEV_SAMP_FLOAT(x)

Returns the sample standard deviation of the value.

SUM_IF(conditional_expr)

Returns the sum of all expression values satisfying the given condition_expr.

VARIANCE(x)

VARIANCE_FLOAT(x)

Alias of VAR_SAMP. Returns the sample variance of the value.

VAR_POP(x)

VAR_POP_FLOAT(x)

Returns the population variance sample variance of the value.

VAR_SAMP(x)

VAR_SAMP_FLOAT(x)

Returns the sample variance of the value.

Usage Notes

  • COUNT(DISTINCT x), especially when used in conjunction with GROUP BY, can require a very large amount of memory to keep track of all distinct values in large tables with large cardinalities. To avoid this large overhead, use APPROX_COUNT_DISTINCT.

  • APPROX_COUNT_DISTINCT(x, e) gives an approximate count of the value x, based on an expected error rate defined in e. The error rate is an integer value from 1 to 100. The lower the value of e, the higher the precision, and the higher the memory cost. Select a value for e based on the level of precision required. On large tables with large cardinalities, consider using APPROX_COUNT_DISTINCT when possible to preserve memory. When data cardinalities permit, OmniSci uses the precise implementation of COUNT(DISTINCT x) for APPROX_COUNT_DISTINCT. Set the default error rate using the -hll-precision-bits configuration parameter.

  • The accuracy of APPROX_MEDIAN (x) upon the distribution of data. For example:

    • For 100,000,000 integers (1, 2, 3, ... 100M) in random order, APPROX_MEDIAN can provide a highly accurate answer 5+ significant digits.

    • For 100,000,001 integers, where 50,000,000 have value of 0 and 50,000,001 have value of 1, APPROX_MEDIAN returns a value close to 0.5, even though the median is 1.

  • Currently, OmniSci does not support grouping by non-dictionary-encoded strings. However, with the SAMPLE aggregate function, you can select non-dictionary-encoded strings that are presumed to be unique in a group. For example:

    SELECT user_name, SAMPLE(user_decription) FROM tweets GROUP BY user_name;

    If the aggregated column (user_description in the example above) is not unique within a group, SAMPLE selects a value that might be nondeterministic because of the parallel nature of OmniSci query execution.

Miscellaneous Functions

Function

Description

SAMPLE_RATIO(x)

Returns a Boolean value, with the probability of True being returned for a row equal to the input argument. The input argument is a numeric value between 0.0 and 1.0. Negative input values (return False), input values greater than 1.0 returns True, and null input values return False.

The result of the function is deterministic per row; that is, all calls of the operator for a given row return the same result. The sample ratio is probabilistic, but is generally within a thousandth of a percentile of the actual range when the underlying dataset is millions of records or larger.

The following example filters approximately 50% of the rows from t and returns a count that is approximately half the number of rows in t:

SELECT COUNT(*) FROM t WHERE SAMPLE_RATIO(0.5)

User-Defined Functions

You can create your own C++ functions and use them in your SQL queries.

  • User-defined Functions (UDFs) require clang++ version 9. You can verify the version installed using the command clang++ --version.

  • UDFs currently allow any authenticated user to register and execute a runtime function. By default, runtime UDFs are globally disabled but can be enabled with the runtime flag enable-runtime-udf.

  1. Create your function and save it in a .cpp file; for example, /var/lib/omnisci/udf_myFunction.cpp.

  2. Add the UDF configuration flag to omnisci.conf. For example:

    udf = "/var/lib/omnisci/udf_myFunction.cpp"
  3. Use your function in a SQL query. For example:

    SELECT udf_myFunction FROM myTable

Sample User-Defined Function

This function, udf_diff.cpp, returns the difference of two values from a table.

#include <cstdint>
#if defined(__CUDA_ARCH__) && defined(__CUDACC__) && defined(__clang__)
#define DEVICE __device__
#define NEVER_INLINE
#define ALWAYS_INLINE
#else
#define DEVICE
#define NEVER_INLINE __attribute__((noinline))
#define ALWAYS_INLINE __attribute__((always_inline))
#endif
#define EXTENSION_NOINLINE extern "C" NEVER_INLINE DEVICE
EXTENSION_NOINLINE int32_t udf_diff(const int32_t x, const int32_t y) { return x - y; }

Code Commentary

Include the standard integer library, which supports the following datatypes:

  • bool

  • int8_t (cstdint), char

  • int16_t (cstdint), short

  • int32_t (cstdint), int

  • int64_t (cstdint), size_t

  • float

  • double

  • void

#include <cstdint>

The next four lines are boilerplate code that allows OmniSci to determine whether the server is running with GPUs. OmniSci chooses whether it should compile the function inline to achieve the best possible performance.

#include <cstdint>
#if defined(__CUDA_ARCH__) && defined(__CUDACC__) && defined(__clang__)
#define DEVICE __device__
#define NEVER_INLINE
#define ALWAYS_INLINE
#else
#define DEVICE
#define NEVER_INLINE __attribute__((noinline))
#define ALWAYS_INLINE __attribute__((always_inline))
#endif
#define EXTENSION_NOINLINE extern "C" NEVER_INLINE DEVICE

The next line is the actual user-defined function, which returns the difference between INTEGER values x and y.

EXTENSION_NOINLINE int32_t udf_diff(const int32_t x, const int32_t y) { return x - y; }

To run the udf_diff function, add this line to your /var/lib/omnisci/omnisci.conf file (in this example, the .cpp file is stored at /var/lib/omnisci/udf_diff.cpp):

udf = "/var/lib/omnisci/udf_diff.cpp"

Restart the OmniSci server.

Use your command from an OmniSci SQL client to query, for example, a table named myTable that contains the INTEGER columns myInt1 and myInt2.

SELECT udf_diff(myInt1, myInt2) FROM myTable LIMIT 1;

OmniSci returns the difference as an INTEGER value.

tf_graph_shortest_paths_distances

Given a distance-weighted directed graph, consisting of a queryCURSOR input consisting of the starting and ending node for each edge and a distance, and a specified origin node, tf_graph_shortest_paths_distances computes the shortest distance-weighted path distance between the origin_node and every other node in the graph. It returns a row for each node in the graph, with output columns consisting of the input origin_node, the given destination_node, the distance for the shortest path between the two nodes, and the number of edges or graph "hops" between the two nodes. If origin_node does not exist in the node1 column of the edge_list CURSOR, an error is returned.

Input Arguments

Output Columns

Example A

Example B

tf_geo_rasterize

Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, with taking the maximum z value across all points in each bin as the output value for the bin. The aggregate performed to compute the value for each bin is specified by agg_type, with allowed aggregate types of AVG, COUNT, SUM, MIN, and MAX. If neighborhood_fill_radius is set greater than 0, a blur pass/kernel will be computed on top of the results according to the optionally-specified fill_agg_type, with allowed types of GAUSS_AVG, BOX_AVG, COUNT, SUM, MIN, and MAX (if not specified, defaults to GAUSS_AVG, or a Gaussian-average kernel). if fill_only_nulls is set to true, only null bins from the first aggregate step will have final output values computed from the blur pass, otherwise if false all values will be affected by the blur pass.

Note that the arguments to bound the spatial output grid (x_min, x_max, y_min, y_max) are optional, however either all or none of these arguments must be supplied. If the arguments are not supplied, the bounds of the spatial output grid will be bounded by the x/y range of the input query, and if SQL filters are applied on the output of the tf_geo_rasterize table function, these filters will also constrain the output range.

Input Arguments

Output Columns

Example

tf_load_point_cloud

Loads one or more las or laz point cloud/LiDAR files from a local file or directory source, optionally tranforming the output SRID to out_srs (if not specified, output points are automatically transformed to EPSG:4326 lon/lat pairs).

If use_cache is set to true, an internal point cloud-specific cache will be used to hold the results per input file, and if queried again will significantly speed up the query time, allowing for interactive querying of a point cloud source. If the results of tf_load_point_cloud will only be consumed once (for example, as part of a CREATE TABLE statement), it is highly recommended that use_cache is set to false or left unspecified (as it is defaulted to false) to avoid the performance and memory overhead incurred by used of the cache.

The bounds of the data retrieved can be optionally specified with the x_min, x_max, y_min, y_max arguments. These arguments can be useful when the user desires to retrieve a small geographic area from a large point-cloud file set, as files containing data outside the bounds of the specified bounding box will be quickly skipped by tf_load_point_cloud, only requiring a quick read of the spatial metadata for the file.

Input Arguments

Output Columns

Example A

Example B

tf_feature_similarity

Given a query input of entity keys, feature columns, and a metric column, and a second query input specifying a search vector of feature columns and metric, computes the similarity of each entity in the first input to the search vector based on their similarity. The score is computed as the cosine similarity of the feature column(s) for each entity with the feature column(s) for the search vector, which can optionally be TF/IDF weighted.

Input Arguments

Output Columns

Example

tf_feature_self_similarity

Given a query input of entity keys/IDs (for example, airplane tail numbers), a set of feature columns (for example, airports visited), and a metric column (for example number of times each airport was visited), scores each pair of entities based on their similarity. The score is computed as the cosine similarity of the feature column(s) between each entity pair, which can optionally be TF/IDF weighted.

Input Arguments

Output Columns

Example

tf_graph_shortest_path

Given a distance-weighted directed graph, consisting of a queryCURSOR input consisting of the starting and ending node for each edge and a distance, and a specified origin and destination node, tf_graph_shortest_path computes the shortest distance-weighted path through the graph between origin_node and destination_node, returning a row for each node along the computed shortest path, with the traversal-ordered index of that node and the cumulative distance from the origin_node to that node. If either origin_node or destination_node do not exist, an error is returned.

Input Arguments

Output Columns

Example A

Example B

tf_geo_rasterize_slope

Similar to tf_geo_rasterize, but also computes the slope and aspect per output bin. Aggregates point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type) across all points in each bin as the output value for the bin. A Gaussian average is then taken over the neighboring bins, with the number of bins specified by neighborhood_fill_radius, optionally only filling in null-valued bins if fill_only_nulls is set to true. The slope and aspect is then computed for every bin, based on the z values of that bin and its neighboring bins. The slope can be returned in degrees or as a fraction between 0 and 1, depending on the boolean argument to compute_slope_in_degrees.

Note that the bounds of the spatial output grid will be bounded by the x/y range of the input query, and if SQL filters are applied on the output of the tf_geo_rasterize_slope table function, these filters will also constrain the output range.

Input Arguments

Output Columns

Example

tf_mandelbrot*

tf_mandelbrot

Example

tf_mandelbrot_cuda

tf_mandelbrot_float

tf_mandelbrot_cuda_float

Expression representing a set of window aggregates. See

For information about geospatial datatype sizes, see and in .

For more information on WKT primitives, see .

HEAVY.AI supports SRID 4326 () and 900913 (Google Web Mercator), and 32601-32660,32701-32760 (Universal Transverse Mercator (UTM) Zones). When using geospatial fields, you set the SRID to determine which reference system to use. HEAVY.AI does not assign a default SRID.

For information about importing data, see .

See the tables in below for examples.

Set the to a specific integer value. For example:

You can use BIGINT, INTEGER, SMALLINT, TINYINT, DATE, TIME, TIMESTAMP, or TEXT ENCODING DICT data types. TEXT ENCODING DICT is the most efficient because corresponding dictionary IDs are sequential and span a smaller range than, for example, the 65,535 values supported in a SMALLINT field. Depending on the number of values in your field, you can use TEXT ENCODING DICT(32) (up to approximately 2,150,000,000 distinct values), TEXT ENCODING DICT(16) (up to 64,000 distinct values), or TEXT ENCODING DICT(8) (up to 255 distinct values). For more information, see .

HEAVY.AI provides access to a set system-provided table functions, also known as table-valued functions (TVS). System table functions, like user-defined table functions, support execution of queries on both CPU and GPU over one or more SQL result-set inputs. Table function support in HEAVY.AI can be split into two broad categories: system table functions and user-defined table functions (UDTFs). System table functions are built-in to the HEAVY.AI server, while UDTFs can be declared dynamically at run-time by specifying them in , a subset of the Python language. For more information on UDTFs, see .

Computes the over the complex domain [x_min, x_max), [y_min, y_max), discretizing the xy-space into an output of dimensions x_pixels X y_pixels.

For information about the HeavyRF radio frequency propagation simulation and HeavyRF table functions, see .

pattern uses .

Search string str for pattern, which is a , and return the matching substring.

<code></code><code></code>

<code></code>

Accuracy of APPROX_MEDIAN depends on the distribution of data; see .

Parameter
Description
Data Types
Name
Description
Data Types
Parameter
Description
Data Types
Name
Description
Data Types
Parameter
Description
Data Types
Name
Description
Data Types
Parameter
Description
Data Type
Name
Description
Data Types
Parameter
Description
Data Type
Name
Description
Data Types
Parameter
Description
Data Types
Name
Description
Data Types
Parameter
Description
Data Types
Name
Description
Data Types

Computes the over the complex domain [x_min, x_max), [y_min, y_max), discretizing the xy-space into an output of dimensions x_pixels X y_pixels. The output for each cell is the number of iterations needed to escape to infinity, up to and including the specified max_iterations.

Parameter
Data Type
Parameter
Data Type
Parameter
Data Type
Parameter
Data Type
heavysql> \d arr
CREATE TABLE arr (
sia SMALLINT[])
omnisql> select sia, CARDINALITY(sia) from arr;
sia|EXPR$0
NULL|NULL
{}|0
{NULL}|1
{1}|1
{2,2}|2
{3,3,3}|3
Window Functions
Wikipedia: Well-known Text: Geometric objects
WGS 84
Importing Geospatial Data
Data Types and Fixed Encoding
Numba
User-Defined Table Functions
HeavyRF
Datatypes
Storage
Compression
Geospatial Functions
SELECT * FROM TABLE(
    tf_graph_shortest_paths_distances(
        edge_list => CURSOR(
            SELECT node1, node2, distance FROM table
        ),
        origin_node => <origin node>
    )

node1

Origin node column in directed edge list CURSOR

Column<INT | BIGINT | TEXT ENCODED DICT>

node2

Destination node column in directed edge list CURSOR

Column<INT | BIGINT | TEXT ENCODED DICT> (must be the same type as node1)

distance

Distance between origin and destination node in directed edge list CURSOR

Column INT | BIGINT | FLOAT | DOUBLE>

origin_node

The origin node to start graph traversal from. If not a value present in edge_list.node1, will cause empty result set to be returned.

BIGINT | TEXT ENCODED DICT

origin_node

Starting node in graph traversal. Always equal to input origin_node.

Column <INT | BIGINT | TEXT ENCODED DICT> (same type as the node1 and node2 input columns)

destination_node

Final node in graph traversal. Will be equal to one of values of node2 input column.

Column <INT | BIGINT | TEXT ENCODED DICT> (same type as the node1 and node2 input columns)

distance

Cumulative distance between origin and destination node for shortest path graph traversal.

Column<INT | BIGINT | FLOAT | DOUBLE> (same type as the distance input column)

num_edges_traversed

Number of edges (or "hops") traversed in the graph to arrive at destination_node from origin_node for the shortest path graph traversal between these two nodes.

Column <INT>

/* Compute the 10 furthest destination airports as measured by average travel-time
when departing origin airport 'RDU' (Raleigh-Durham, NC) on United Airlines for the
year 2008, adding 60 minutes for each leg to account forboarding/plane change time 
costs. */

SELECT
  *
FROM
  TABLE(
    tf_graph_shortest_paths_distances(
      edge_list => CURSOR(
        SELECT
          origin,
          dest,
          /* Add 60 minutes to each leg to account for boarding/plane change costs */
          AVG(airtime) + 60 as avg_airtime
        FROM
          flights_2008
        WHERE
          carrier_name = 'United Air Lines'
        GROUP by
          origin,
          dest
      ),
      origin_node => 'RDU'
    )
  )
ORDER BY
  distance DESC
LIMIT
  10;
  
origin_node|destination_node|distance|num_edges_traversed
RDU|JFK|803|3
RDU|LIH|757|2
RDU|KOA|746|2
RDU|HNL|735|2
RDU|OGG|728|2
RDU|EUG|595|3
RDU|ANC|586|2
RDU|SJC|468|2
RDU|SFO|468|2
RDU|OAK|468|2
/* Compute the all-destinations path distances along a time-traversal weighted
edge graph of roads in the Eastern United States from a location in North Carolina joining to a node locations table to output the lon/lat pairs 
of each destination node. */

select
  destination_node,
  lon,
  lat
  distance,
  num_edges_traversed
from
  table(
    tf_graph_shortest_paths_distances(
      cursor(
        select
          node1,
          node2,
          traversal_time
        from
          usa_roads_east_time
      ),
      1561955
    )
  ),
  USA_roads_east_coords
where
  destination_node = node_id
order by
  distance desc
limit
  20;
  
destination_node|lon|lat|distance|num_edges_traversed
2228153|-69.74701|46.941648|22021532|5387
324156|-69.67822799999999|46.990543|21916494|5386
324151|-69.687833|46.933106|21906798|5386
1372661|-69.64962799999999|46.942144|21830101|5385
320610|-69.47672399999999|46.967413|21807384|5379
324152|-69.637714|46.958516|21798959|5385
1372667|-69.633437|46.95189999999999|21793379|5385
1372662|-69.63483099999999|46.954334|21786119|5384
2228156|-69.622767|46.949534|21768541|5383
1372670|-69.58720599999999|46.942504|21759257|5382
1372663|-69.62387099999999|46.968569|21741445|5383
2226724|-69.557773|46.969276|21714682|5381
324159|-69.607209|46.967823|21709789|5382
324160|-69.59385999999999|46.967445|21691648|5382
2228155|-69.59575599999999|46.967461|21688053|5381
320578|-69.57176699999999|47.067628|21683322|5377
1372669|-69.58906999999999|46.977104|21675010|5382
2226740|-69.582106|46.991048|21673764|5379
320609|-69.55000199999999|46.966089|21668411|5378
324158|-69.585776|46.973521|21663260|5381
SELECT * FROM TABLE(
  tf_geo_rasterize(
      raster => CURSOR(
        SELECT 
           x, y, z FROM table
      ),
      agg_type => <'AVG'|'COUNT'|'SUM'|'MIN'|'MAX'>,
      /* fill_agg_type is optional */
      [<fill_agg_type> => <'AVG'|'COUNT'|'SUM'|'MIN'|'MAX'|'GAUSS_AVG'|'BOX_AVG'>,] 
      bin_dim_meters => <meters>, 
      geographic_coords => <true/false>, 
      neighborhood_fill_radius => <radius in bins>,
      fill_only_nulls => <true/false> [,
      <x_min> => <minimum output x-coordinate>,
      <x_max> => <maximum output x-coordinate>,
      <y_min> => <minimum output y-coordinate>,
      <y_max> => <maximum output y-coordinate>]
    ) 
  )...

x

X-coordinate column or expression

Column<FLOAT | DOUBLE>

y

Y-coordinate column or expression

Column<FLOAT | DOUBLE>

z

Z-coordinate column or expression. The output bin is computed as the maximum z-value for all points falling in each bin.

Column<FLOAT | DOUBLE>

agg_type

The aggregate to be performed to compute the output z-column. Should be one of 'AVG', 'COUNT', 'SUM', 'MIN', or 'MAX'.

TEXT ENCODING NONE

fill_agg_type (optional)

The aggregate to be performed when computing the blur pass on the output bins. Should be one of 'AVG', 'COUNT', 'SUM', 'MIN', 'MAX', ' 'AVG', 'COUNT', 'SUM', 'GAUSS_AVG', or 'BOX_AVG'. Note that AVG is synonymous with GAUSS_AVG in this context, and the default fill_agg_type if not specified is GAUSS_AVG.

TEXT ENCODING NONE

bin_dim_meters

The width and height of each x/y bin in meters. If geographic_coords is not set to true, the input x/y units are already assumed to be in meters.

DOUBLE

geographic_coords

If true, specifies that the input x/y coordinates are in lon/lat degrees. The function will then compute a mapping of degrees to meters based on the center coordinate between x_min/x_max and y_min/y_max.

BOOLEAN

neighborhood_fill_radius

The radius in bins to compute the box blur/filter over, such that each output bin will be the average value of all bins within neighborhood_fill_radius bins.

DOUBLE

fill_only_nulls

Specifies that the box blur should only be used to provide output values for null output bins (i.e. bins that contained no data points or had only data points with null Z-values).

BOOLEAN

x_min (optional)

Min x-coordinate value (in input units) for the spatial output grid.

DOUBLE

x_max (optional)

Max x-coordinate value (in input units) for the spatial output grid.

DOUBLE

y_min (optional)

Min y-coordinate value (in input units) for the spatial output grid.

DOUBLE

y_max (optional)

Max y-coordinate value (in input units) for the spatial output grid.

DOUBLE

x

The x-coordinates for the centroids of the output spatial bins.

Column<FLOAT | DOUBLE> (same as input x-coordinate column/expression)

y

The y-coordinates for the centroids of the output spatial bins.

Column<FLOAT | DOUBLE> (same as input y-coordinate column/expression)

z

The maximum z-coordinate of all input data assigned to a given spatial bin.

Column<FLOAT | DOUBLE> (same as input z-coordinate column/expression)

/* Bin 10cm USGS LiDAR from Tallahassee to 1 meter, taking the minimum z-value
for each xy-bin. Then for each xy-bin, perform a Gaussian-average over the neighboring
100 xy-bins. This query yields the approximate terrain for an area after removing human-made
structures (due to the wide 100-bin Gaussian-average window), as can be seen in the 
right-hand render result in the screenshot below. Note that the LIMIT was only
applied to this SQL query and is not used in the rendered-screenshot below. */

SELECT
  x,
  y,
  z
FROM
  TABLE(
    tf_geo_rasterize(
      raster => CURSOR(
        SELECT
          ST_X(pt),
          ST_Y(pt),
          z
        FROM
          USGS_LPC_FL_LeonCo_2018_049377_N_LAS_2019
      ),
      bin_dim_meters => 1,
      geographic_coords => TRUE,
      neighborhood_fill_radius => 100,
      fill_only_nulls => FALSE,
      agg_type => 'MIN',
      fill_agg_type => 'GAUSS_AVG'
    )
  ) limit 20;
  
x|y|z
-84.29857764791747|30.40240526206634|-15.30264
-84.29086331121893|30.40264801040913|-17.25718
-84.29856722313815|30.40240526206634|-15.31047
-84.29855679835883|30.40240526206634|-15.31835
-84.29085288643959|30.40264801040913|-17.25859
-84.2985463735795|30.40240526206634|-15.32627
-84.30278925876371|30.402198476441|-17.09047
-84.29084246166028|30.40264801040913|-17.25993
-84.30277883398438|30.402198476441|-17.10194
-84.29853594880018|30.40240526206634|-15.33422
-84.30276840920506|30.402198476441|-17.11329
-84.29083203688096|30.40264801040913|-17.26122
-84.30275798442574|30.402198476441|-17.12446
-84.29852552402086|30.40240526206634|-15.34223
-84.30274755964642|30.402198476441|-17.1354
-84.29878614350392|30.40263002905041|-14.74146
-84.29119690415723|30.40236030866953|-17.22919
-84.30449892257258|30.40238728070761|-15.9867
-84.29328186002171|30.40223443915845|-17.63177
-84.29432433795395|30.40263901972977|-17.85748  
SELECT * FROM TABLE(
    tf_load_point_cloud(
        path => <path>,
        [out_srs => <out_srs>,
        use_cache => <use_cache>,
        x_min => <x_min>,
        x_max => <x_max>,
        y_min => <y_min>,
        y_max => <y_max>]
    )
)    

path

The path of the file or directory containing the las/laz file or files. Can contain globs. Path must be in allowed-import-paths.

TEXT ENCODING NONE

out_srs (optional)

EPSG code of the output SRID. If not specified, output points are automatically converted to lon/lat (EPSG 4326).

TEXT ENCODING NONE

use_cache (optional)

If true, use internal point cloud cache. Useful for inline querying of the output of tf_load_point_cloud. Should turn off for one-shot queries or when creating a table from the output, as adding data to the cache incurs performance and memory usage overhead. If not specified, is defaulted to false/off.

BOOLEAN

x_min (optional)

Min x-coordinate value (in degrees) for the output data.

DOUBLE

x_max (optional)

Max x-coordinate value (in degrees) for the output data.

DOUBLE

y_min(optional)

Min y-coordinate value (in degrees) for the output data.

DOUBLE

y_max (optional)

Max y-coordinate value (in degrees) for the output data.

DOUBLE

CREATE TABLE wake_co_lidar_test AS
SELECT
  *
FROM
  TABLE(
    tf_load_point_cloud(
      path => '/path/to/20150118_LA_37_20066601.laz'
    )
  );
SELECT
  x, y, z, classification
FROM
  TABLE(
    tf_load_point_cloud(
      path => '/path/to/las_files/*.las',
      out_srs => 'EPSG:4326',
      use_cache => true,
      y_min => 37.0,
      y_max => 38.0,
      x_min => -123.0,
      x_max => -122.0
    )
  )
SELECT
  *
FROM
  TABLE(
    tf_feature_similarity(
      primary_features => CURSOR(
        SELECT
          primary_key,
          pivot_features,
          metric
        from
          table
        where
          ...
        group by
          primary_key,
          pivot_features
      ),
      comparison_features => CURSOR(
        SELECT
          comparison_metric
        from
          table
        where
          ...
        group by <column>
      ),
      use_tf_idf => <boolean>
    )
  )

class

ID of the primary key being compared against the search vector.

Column<TEXT ENCODING DICT | INT | BIGINT> (type will be the same of primary_key input column)

similarity_score

Computed cosine similarity score between each primary_key pair, with values falling between 0 (completely dissimilar) and 1 (completely similar).

Column<FLOAT>

/* Compute the similarity of US airline flight nums to a particular
Delta flight (DAL795) based on the cosine similarity of the overlap of
flight paths binned to a H3 Hex at zoom level 7 (roughly 5 sq km),  
and return the top 10 most similar flight nums */

SELECT
  *
FROM
  TABLE(
    tf_feature_similarity(
      primary_features => CURSOR(
        SELECT
          callsign,
          geotoh3(st_x(location), st_y(location), 7) as h3,
          count(*) as n
        from
          adsb_2021_03_01
        where
          operator in (
            'Delta Air Lines',
            'Alaska Airlines',
            'Southwest Airlines',
            'American Airlines',
            'United Airlines'
          )
          and altitude >= 1000
        group by
          callsign,
          h3
      ),
      comparison_features => CURSOR(
        SELECT
          geotoh3(st_x(location), st_y(location), 7) as h3,
          COUNT(*) as n
        from
          adsb_2021_03_01
        where
          callsign = 'DAL795'
          and altitude >= 1000
        group by
          h3
      ),
      use_tf_idf => false
    )
  )
ORDER BY
  similarity_score desc
limit
  10;
  
class|similarity_score
DAL795|1
DAL538|0.610889
DAL1192|0.3419932
DAL1185|0.3391671
SWA4346|0.3206964
DAL365|0.3037131
SWA953|0.2912168
UAL1559|0.2747431
SWA2098|0.2511763
DAL526|0.2473387
select * from table(
  tf_feature_self_similarity(
    primary_features => cursor(
      select
        primary_key,
        pivot_features,
        metric
      from
        table
      group by
        primary_key,
        pivot_features
    ),
    use_tf_idf => <boolean>))

class1

ID of the first primary key in the pair-wise comparison.

Column<TEXT ENCODING DICT | INT | BIGINT> (type is the same of primary_key input column)

class2

ID of the second primary key in the pair-wise comparison. Because the computed similarity score for a pair of primary keys is order-invariant, results are output only for ordering such that class1 <= class2. For primary keys of type TextEncodingDict, the order is based on the internal integer IDs for each string value and not lexicographic ordering.

Column<TEXT ENCODING DICT | INT | BIGINT> (type is the same of primary_key input column)

similarity_score

Computed cosine similarity score between each primary_key pair, with values falling between 0 (completely dissimilar) and 1 (completely similar).

Column<Float>

/* Compute similarity of airlines by the airports they fly from */

select
  *
from
  table(
    tf_feature_self_similarity(
      primary_features => cursor(
        select
          carrier_name,
          origin,
          count(*) as num_flights
        from
          flights_2008
        group by
          carrier_name,
          origin
      ),
      use_tf_idf => false
    )
  )
where
  similarity_score <= 0.99
order by
  similarity_score desc
limit
  20;
  
class1|class2|similarity_score
Expressjet Airlines|Continental Air Lines|0.9564615
Delta Air Lines|Atlantic Southeast Airlines|0.9436753
Delta Air Lines|AirTran Airways Corporation|0.9379856
Atlantic Southeast Airlines|AirTran Airways Corporation|0.9326661
American Eagle Airlines|American Airlines|0.8906327
Northwest Airlines|Pinnacle Airlines|0.8222722
Skywest Airlines|United Air Lines|0.6857293
Mesa Airlines|US Airways|0.6116939
United Air Lines|Frontier Airlines|0.5921053
Mesa Airlines|United Air Lines|0.5686765
United Air Lines|American Eagle Airlines|0.5272493
Skywest Airlines|Frontier Airlines|0.4684323
Southwest Airlines|US Airways|0.4166781
United Air Lines|American Airlines|0.397027
Comair|JetBlue Airways|0.3631534
Mesa Airlines|American Eagle Airlines|0.3379275
Skywest Airlines|American Eagle Airlines|0.3331468
Mesa Airlines|Skywest Airlines|0.3235496
Comair|Delta Air Lines|0.3075919
Southwest Airlines|Mesa Airlines|0.2901711

/* Compute the similarity of US States by the TF-IDF
 weighted cosine similarity of the words tweeted in each state */
 
 select
  *
from
  table(
    tf_feature_self_similarity(
      primary_features => cursor(
        select
          state_abbr,
          unnest(tweet_tokens),
          count(*)
        from
          tweets_2022_06
        where country = 'US'
        group by
          state_abbr,
          unnest(tweet_tokens)
      ),
      use_tf_idf => TRUE
    )
  )
where
  class1 <> class2
order by
  similarity_score desc;
  
TX|GA|0.9928479
IL|TN|0.9920474
IL|NC|0.9920027
TX|IL|0.9917723
IN|OH|0.9916649
TN|NC|0.9915619
CA|TX|0.9910875
IN|VA|0.9909871
CA|IL|0.9909689
IL|OH|0.9909481
TX|NC|0.9908867
IL|MO|0.9907863
IN|MI|0.990751
TN|OH|0.9907123
IL|MD|0.9907106
OH|NC|0.9905779
VA|OH|0.990536
IN|IL|0.9904549
IN|MO|0.9903805
TX|TN|0.9903381
SELECT * FROM TABLE(
    tf_graph_shortest_path(
        edge_list => CURSOR(
            SELECT node1, node2, distance FROM table
        ),
        origin_node => <origin node>,
        destination_node => <destination node>
    )

node1

Origin node column in directed edge list CURSOR

Column< INT | BIGINT | TEXT ENCODED DICT>

node2

Destination node column in directed edge list CURSOR

Column< INT | BIGINT | TEXT ENCODED DICT> (must be the same type as node1)

distance

Distance between origin and destination node in directed edge list CURSOR

Column< INT | BIGINT | FLOAT | DOUBLE >

origin_node

The origin node to start graph traversal from. If not a value present in edge_list.node1, will cause empty result set to be returned.

BIGINT | TEXT ENCODED DICT

destination_node

The destination node to finish graph traversal at. If not a value present in edge_list.node1, will cause empty result set to be returned.

BIGINT | TEXT ENCODED DICT

path_step

The index of this node along the path traversal from origin_node to destination_node, with the first node (the origin_node) indexed as 1.

Column< INT >

node

The current node along the path traversal from origin_node to destination_node. The first node (as denoted by path_step = 1) will always be the input origin_node, and the final node (as denoted by MAX(path_step)) will always be the input destination_node.

Column < INT | BIGINT | TEXT ENCODED DICT> (same type as the node1 and node2 input columns)

cume_distance

The cumulative distance adding all input distance values from the origin_node to the current node.

Column < INT | BIGINT | FLOAT | DOUBLE> (same type as the distance input column)

/* Compute the shortest flight route on United Airlines for the year 2008 as measured
by flight time between origin airport 'RDU' (Raleigh-Durham, NC) and destination 
airport 'SAT' (San Antonio, TX), adding 60 minutes for each leg to account for 
boarding/plane change time costs, and only counting routes that were flown at least
300 times during the year. */
 
SELECT
  *
FROM
  TABLE(
    tf_graph_shortest_path(
      edge_list => CURSOR(
        SELECT
          origin,
          dest,
          /* Add 60 minutes to each leg to account
          for boarding/plane change costs */
          AVG(airtime) + 60 as avg_airtime
        FROM
          flights_2008
        WHERE
          carrier_name = 'United Air Lines'
        GROUP by
          origin,
          dest
        HAVING
          COUNT(*) > 300
      ),
      origin_node => 'RDU',
      destination_node => 'SAT'
    )
  )
ORDER BY
  path_step
 
path_step|node|cume_distance
1|RDU|0
2|ORD|167
3|DEN|354
4|SAT|519
/* Compute the shortest path between along a time-traversal weighted
edge graph of roads in the Eastern United States between a location in North Carolina and
a location in Maine, joining to a node locations table to output the lon/lat pairs 
of each node. */

select
  path_step,
  node,
  lon,
  lat,
  cume_distance
from
  table(
    tf_graph_shortest_path(
      cursor(
        select
          node1,
          node2,
          traversal_time
        from
          usa_roads_east_time
      ),
      1561955,
      1591319
    )
  ),
  USA_roads_east_coords
where
  node = node_id 
order by 
  cume_distance desc
limit 20;

path_step|node|lon|lat|cume_distance
4380|1591319|-71.55136299999999|43.75256|13442017
4379|1591989|-71.55174099999999|43.75245|13441199
4378|1589348|-71.554147|43.752464|13436371
4377|2315795|-71.554867|43.752489|13434924
4376|1589286|-71.55497099999999|43.752113|13434214
4375|1589285|-71.555049|43.751833|13433685
4374|2315785|-71.555999|43.750704|13431238
4373|2315973|-71.55798799999999|43.748622|13426553
4372|2315950|-71.56366299999999|43.746268|13417798
4371|1589788|-71.56476599999999|43.745765|13416053
4370|1591997|-71.56484|43.745691|13415884
4369|1589787|-71.564886|43.745645|13415779
4368|2315951|-71.56517599999999|43.745353|13415113
4367|2315952|-71.56659499999999|43.744599|13412756
4366|1591999|-71.56685899999999|43.744565|13412397
4365|543394|-71.567357|43.744335|13411606
4364|543393|-71.567832|43.744116|13410852
4363|543392|-71.571827|43.743673|13405444
4362|541181|-71.57268499999999|43.743802|13404271
4361|1589786|-71.572964|43.743844|13403890
SELECT * FROM TABLE(
  tf_geo_rasterize_slope(
      raster => CURSOR(
        SELECT 
           x, y, z FROM table
      ),
      agg_type => <'AVG'|'COUNT'|'SUM'|'MIN'|'MAX'>,
      bin_dim_meters => <meters>, 
      geographic_coords => <true/false>, 
      neighborhood_fill_radius => <radius in bins>,
      fill_only_nulls => <true/false>,
      compute_slope_in_degrees => <true/false>
    )
 ) 

x

Input x-coordinate column or expression.

Column<FLOAT | DOUBLE>

y

Input y-coordinate column or expression.

Column<FLOAT | DOUBLE>

z

Input z-coordinate column or expression. The output bin is computed as the maximum z-value for all points falling in each bin.

Column<FLOAT | DOUBLE>

agg_type

The aggregate to be performed to compute the output z-column. Should be one of 'AVG', 'COUNT', 'SUM', 'MIN', or 'MAX'.

TEXT ENCODING NONE

bin_dim_meters

The width and height of each x/y bin in meters. If geographic_coords is not set to true, the input x/y units are already assumed to be in meters.

DOUBLE

geographic_coords

If true, specifies that the input x/y coordinates are in lon/lat degrees. The function will then compute a mapping of degrees to meters based on the center coordinate between x_min/x_max and y_min/y_max.

BOOLEAN

neighborhood_fill_radius

The radius in bins to compute the box blur/filter over, such that each output bin will be the average value of all bins within neighborhood_fill_radius bins.

BIGINT

fill_only_nulls

Specifies that the box blur should only be used to provide output values for null output bins (i.e. bins that contained no data points or had only data points with null Z-values).

BOOLEAN

compute_slope_in_degrees

If true, specifies the slope should be computed in degrees (with 0 degrees perfectly flat and 90 degrees perfectly vertical). If false, specifies the slope should be computed as a fraction from 0 (flat) to 1 (vertical). In a future release, we are planning to move the default output to percentage slope.

BOOLEAN

x

The x-coordinates for the centroids of the output spatial bins.

Column<FLOAT | DOUBLE> (same as input x column/expression)

y

The y-coordinates for the centroids of the output spatial bins.

Column<FLOAT | DOUBLE> (same as input y column/expression)

z

The maximum z-coordinate of all input data assigned to a given spatial bin.

Column<FLOAT | DOUBLE> (same as input z column/expression)

slope

The average slope of an output grid cell (in degrees or a fraction between 0 and 1, depending on the argument to compute_slope_in_degrees).

Column<FLOAT | DOUBLE> (same as input z column/expression)

aspect

The direction from 0 to 360 degrees pointing towards the maximum downhill gradient, with 0 degrees being due north and moving clockwise from N (0°) -> NE (45°) -> E (90°) -> SE (135°) -> S (180°) -> SW (225°) -> W (270°) -> NW (315°).

Column<FLOAT | DOUBLE> (same as input z column/expression)

/* Compute the slope and aspect ratio for a 30-meter Copernicus 
Digital Elevation Model (DEM) raster, binned to 90-meters */

select
  *
from
  table(
    tf_geo_rasterize_slope(
      raster => cursor(
        select
          st_x(raster_point),
          st_y(raster_point),
          CAST(z AS float)
        from
          copernicus_30m_mt_everest
      ),
      agg_type => 'AVG',
      bin_dim_meters => 90.0,
      geographic_coords => true,
      neighborhood_fill_radius => 1,
      fill_only_nulls => false,
      compute_slope_in_degrees => true
    )
  )
order by
  slope desc nulls last
limit
  20;
  
x|y|z|slope|aspect
86.96533511629579|27.96534132281817|6212.096|78.37033|18.09232
87.23751907091268|27.78489838800869|3793.584|78.17864|125.03
87.23660262662104|27.78408922686605|3929.989|78.06877|127.629
86.96625156058742|27.96534132281817|6041.277|78.00574|19.00616
87.2356861823294|27.78328006572341|3981.662|77.53327|127.3175
86.96441867200414|27.96615048396082|5869.373|77.3751|20.82031
86.95800356196267|27.96857796738875|6083.791|77.13709|29.89468
86.96350222771251|27.96615048396082|6081.35|77.08266|21.6792
87.23843551520432|27.78570754915134|3630.32|77.04676|125.2154
86.96441867200414|27.96534132281817|6378.94|76.95021|17.77107
87.22468885082972|27.81321902800121|4771.554|76.71017|253.2764
87.2356861823294|27.78247090458076|3520.049|76.63997|113.6511
87.23660262662104|27.78328006572341|3445.282|76.38319|127.2889
86.96716800487906|27.96534132281817|5864.711|76.16835|19.27573
87.23476973803776|27.78166174343812|3945.683|76.13519|102.7789
86.95708711767104|27.96857796738875|6336.072|76.13168|24.90349
87.22468885082972|27.81240986685857|4732.937|76.07494|264.7046
87.23751907091268|27.78408922686605|3367.659|76.0099|126.7463
86.9589200062543|27.9677688062461|6223.083|75.46346|26.85898
87.22377240653809|27.81402818914385|4704.619|75.41299|205.3219
SELECT * FROM TABLE(
  tf_mandelbrot( 
    x_pixels => <x_pixels>,
    y_pixels => <y_pixels>,
    x_min => <x_min>,
    x_max => <x_max>,
    y_min => <y_min>,
    y_max => <y_max>,
    max_iterations => <max_iterations>
  )
)  

x_pixels

32-bit integer

y_pixels

32-bit integer

x_min

DOUBLE

x_max

DOUBLE

y_min

DOUBLE

y_max

DOUBLE

max_iterations

32-bit integer

SELECT * FROM TABLE(
  tf_mandelbrot_cuda( <x_pixels>, <y_pixels>, <x_min>, <x_max>, <y_min>, <y_max>, <max_iterations>
  )
)

x_pixels

32-bit integer

y_pixels

32-bit integer

x_min

DOUBLE

x_max

DOUBLE

y_min

DOUBLE

y_max

DOUBLE

max_iterations

32-bit integer

SELECT * FROM TABLE(
  tf_mandelbrot_float(<x_pixels>, <y_pixels>, <x_min>, <x_max>, <y_min>, <y_max>, <max_iterations>
  )
)

x_pixels

32-bit integer

y_pixels

32-bit integer

x_min

DOUBLE

x_max

DOUBLE

y_min

DOUBLE

y_max

DOUBLE

max_iterations

32-bit integer

SELECT * FROM TABLE(
  tf_mandelbrot_cuda_float( <x_pixels>, <y_pixels>, <x_min>, <x_max>, <y_min>, <y_max>, <max_iterations>
  )
)

x_pixels

32-bit integer

y_pixels

32-bit integer

x_min

DOUBLE

x_max

DOUBLE

y_min

DOUBLE

y_max

DOUBLE

max_iterations

32-bit integer

tf_point_cloud_metadata

Returns metadata for one or more las or laz point cloud/LiDAR files from a local file or directory source, optionally constraining the bounding box for metadata retrieved to the lon/lat bounding box specified by the x_min, x_max, y_min, y_max arguments.

Note: specified path must be contained in global allowed-import-paths, otherwise an error will be returned.

SELECT * FROM TABLE(
    tf_point_cloud_metadata(
        path => <path>,
        [x_min => <x_min>,
        x_max => <x_max>,
        y_min => <y_min>,
        y_max => <y_max>]
    )
)

Input Arguments

Parameter
Description
Data Types

path

The path of the file or directory containing the las/laz file or files. Can contain globs. Path must be in allowed-import-paths.

TEXT ENCODING NONE

x_min (optional)

Min x-coordinate value for point cloud files to retrieve metadata from.

DOUBLE

x_max (optional)

Max x-coordinate value for point cloud files to retrieve metadata from.

DOUBLE

y_min (optional)

Min y-coordinate value for point cloud files to retrieve metadata from.

DOUBLE

y_max (optional)

Max y-coordinate value for point cloud files to retrieve metadata from.

DOUBLE

Output Columns

Name
Description
Data Types

file_path

Full path for the las or laz file

Column<TEXT ENCODING DICT>

file_name

Filename for the las or laz file

Column<TEXT ENCODING DICT>

file_source_id

File source id per file metadata

Column<SMALLINT>

version_major

LAS version major number

Column<SMALLINT>

version_minor

LAS version minor number

Column<SMALLINT>

creation_year

Data creation year

Column<SMALLINT>

is_compressed

Whether data is compressed, i.e. LAZ format

Column<BOOLEAN>

num_points

Number of points in this file

Column<BIGINT>

num_dims

Number of data dimensions for this file

Column<SMALLINT>

point_len

Not currently used

Column<SMALLINT>

has_time

Whether data has time value

COLUMN<BOOLEAN>

has_color

Whether data contains rgb color value

COLUMN<BOOLEAN>

has_wave

Whether data contains wave info

COLUMN<BOOLEAN>

has_infrared

Whether data contains infrared value

COLUMN<BOOLEAN>

has_14_point_format

Data adheres to 14-attribute standard

COLUMN<BOOLEAN>

specified_utm_zone

UTM zone of data

Column<INT>

x_min_source

Minimum x-coordinate in source projection

Column<DOUBLE>

x_max_source

Maximum x-coordinate in source projection

Column<DOUBLE>

y_min_source

Minimum y-coordinate in source projection

Column<DOUBLE>

y_max_source

Maximum y-coordinate in source projection

Column<DOUBLE>

z_min_source

Minimum z-coordinate in source projection

Column<DOUBLE>

z_max_source

Maximum z-coordinate in source projection

Column<DOUBLE>

x_min_4326

Minimum x-coordinate in lon/lat degrees

Column<DOUBLE>

x_max_4326

Maximum x-coordinate in lon/lat degrees

Column<DOUBLE>

y_min_4326

Minimum y-coordinate in lon/lat degrees

Column<DOUBLE>

y_max_4326

Maximum y-coordinate in lon/lat degrees

Column<DOUBLE>

z_min_4326

Minimum z-coordinate in meters above sea level (AMSL)

Column<DOUBLE>

z_max_4326

Maximum z-coordinate in meters above sea level (AMSL)

Column<DOUBLE>

Example

SELECT
  file_name,
  num_points,
  specified_utm_zone,
  x_min_4326,
  x_max_4326,
  y_min_4326,
  y_max_4326
FROM
  TABLE(
    tf_point_cloud_metadata(
      path => '/home/todd/data/lidar/las_files/*2010_00000*.las'
    )
  )
ORDER BY
  file_name;
  
file_name|num_points|specified_utm_zone|x_min_4326|x_max_4326|y_min_4326|y_max_4326
ARRA-CA_GoldenGate_2010_000001.las|2063102|10|-122.9943066785969|-122.9772226614453|37.97913478250298|37.99265200734278
ARRA-CA_GoldenGate_2010_000002.las|4755131|10|-122.9943056338411|-122.9772184796481|37.99265416515848|38.00617135784082
ARRA-CA_GoldenGate_2010_000003.las|4833631|10|-122.9943045883859|-122.9772142950517|38.00617351665583|38.01969067717678
ARRA-CA_GoldenGate_2010_000004.las|6518715|10|-122.9943035422309|-122.9772101076538|38.01969283699149|38.03320996534712
ARRA-CA_GoldenGate_2010_000005.las|7508919|10|-122.9943024953755|-122.9772059174526|38.03321212616189|38.04672922234828
ARRA-CA_GoldenGate_2010_000006.las|7442130|10|-122.9943014478193|-122.977201724446|38.04673138416345|38.06024844817669
ARRA-CA_GoldenGate_2010_000007.las|5610772|10|-122.9943003995618|-122.9771975286321|38.06025061099263|38.07376764282882
ARRA-CA_GoldenGate_2010_000008.las|3515095|10|-122.9942993506024|-122.9771933300088|38.07376980664591|38.08728680630115
ARRA-CA_GoldenGate_2010_000009.las|1689283|10|-122.9942898783015|-122.9771554156435|38.19544116402802|38.20895787388029

Reserved Words

Following is a list of HEAVY.AI keywords.

ABS
ACCESS
ADD
ALL
ALLOCATE
ALLOW
ALTER
AMMSC
AND
ANY
ARCHIVE
ARE
ARRAY_MAX_CARDINALITY
ARRAY
AS
ASC
ASENSITIVE
ASYMMETRIC
AT
ATOMIC
AUTHORIZATION
AVG
BEGIN
BEGIN_FRAME
BEGIN_PARTITION
BETWEEN
BIGINT
BINARY
BIT
BLOB
BOOLEAN
BOTH
BY
CALL
CALLED
CARDINALITY
CASCADED
CASE
CAST
CEIL
CEILING
CHAR
CHARACTER
CHARACTER_LENGTH
CHAR_LENGTH
CHECK
CLASSIFIER
CLOB
CLOSE
COALESCE
COLLATE
COLLECT
COLUMN
COMMIT
CONDITION
CONNECT
CONSTRAINT
CONTAINS
CONTINUE
CONVERT
COPY
CORR
CORRESPONDING
COUNT
COVAR_POP
COVAR_SAMP
CREATE
CROSS
CUBE
CUME_DIST
CURRENT
CURRENT_CATALOG
CURRENT_DATE
CURRENT_DEFAULT_TRANSFORM_GROUP
CURRENT_PATH
CURRENT_ROLE
CURRENT_ROW
CURRENT_SCHEMA
CURRENT_TIME
CURRENT_TIMESTAMP
CURRENT_TRANSFORM_GROUP_FOR_TYPE
CURRENT_USER
CURSOR
CYCLE
DASHBOARD
DATABASE
DATE
DATE_TRUNC
DATETIME
DAY
DEALLOCATE
DEC
DECIMAL
DECLARE
DEFAULT
DEFINE
DELETE
DENSE_RANK
DEREF
DESC
DESCRIBE
DETERMINISTIC
DISALLOW
DISCONNECT
DISTINCT
DOUBLE
DROP
DUMP
DYNAMIC
EACH
EDIT
EDITOR
ELEMENT
ELSE
EMPTY
END
END-EXEC
END_FRAME
END_PARTITION
EQUALS
ESCAPE
EVERY
EXCEPT
EXEC
EXECUTE
EXISTS
EXP
EXPLAIN
EXTEND
EXTERNAL
EXTRACT
FALSE
FETCH
FILTER
FIRST
FIRST_VALUE
FLOAT
FLOOR
FOR
FOREIGN
FOUND
FRAME_ROW
FREE
FROM
FULL
FUNCTION
FUSION
GEOGRAPHY 
GEOMETRY 
GET
GLOBAL
GRANT
GROUP
GROUPING
GROUPS
HAVING
HOLD
HOUR
IDENTITY
IF
ILIKE
IMPORT
IN
INDICATOR
INITIAL
INNER
INOUT
INSENSITIVE
INSERT
INT
INTEGER
INTERSECT
INTERSECTION
INTERVAL
INTO
IS
JOIN
LAG
LANGUAGE
LARGE
LAST_VALUE
LAST
LATERAL
LEAD
LEADING
LEFT
LENGTH
LIKE
LIKE_REGEX
LIMIT
LINESTRING 
LN
LOCAL
LOCALTIME
LOCALTIMESTAMP
LOWER
MATCH
MATCH_NUMBER
MATCH_RECOGNIZE
MATCHES
MAX
MEASURES
MEMBER
MERGE
METHOD
MIN
MINUS
MINUTE
MOD
MODIFIES
MODULE
MONTH
MULTIPOLYGON 
MULTISET
NATIONAL
NATURAL
NCHAR
NCLOB
NEW
NEXT
NO
NONE
NORMALIZE
NOT
NOW
NTH_VALUE
NTILE
NULL
NULLIF
NULLX
NUMERIC
OCCURRENCES_REGEX
OCTET_LENGTH
OF
OFFSET
OLD
OMIT
ON
ONE
ONLY
OPEN
OPTIMIZE
OPTION
OR
ORDER
OUT
OUTER
OVER
OVERLAPS
OVERLAY
PARAMETER
PARTITION
PATTERN
PER
PERCENT
PERCENT_RANK
PERCENTILE_CONT
PERCENTILE_DISC
PERIOD
PERMUTE
POINT 
POLYGON 
PORTION
POSITION
POSITION_REGEX
POWER
PRECEDES
PRECISION
PREPARE
PREV
PRIMARY
PRIVILEGES
PROCEDURE
PUBLIC
RANGE
RANK
READS
REAL
RECURSIVE
REF
REFERENCES
REFERENCING
REGR_AVGX
REGR_AVGY
REGR_COUNT
REGR_INTERCEPT
REGR_R2
REGR_SLOPE
REGR_SXX
REGR_SXY
REGR_SYY
RELEASE
RENAME
RESET
RESULT
RESTORE
RETURN
RETURNS
REVOKE
RIGHT
ROLE 
ROLLBACK
ROLLUP
ROW
ROW_NUMBER
ROWS
ROWID 
RUNNING
SAVEPOINT
SCHEMA
SCOPE
SCROLL
SEARCH
SECOND
SEEK
SELECT
SENSITIVE
SESSION_USER
SET
SHOW
SIMILAR
SKIP
SMALLINT
SOME
SPECIFIC
SPECIFICTYPE
SQL
SQLEXCEPTION
SQLSTATE
SQLWARNING
SQRT
START
STATIC
STDDEV_POP
STDDEV_SAMP
STREAM
SUBMULTISET
SUBSET
SUBSTRING
SUBSTRING_REGEX
SUCCEEDS
SUM
SYMMETRIC
SYSTEM
SYSTEM_TIME
SYSTEM_USER
TABLE
TABLESAMPLE
TEMPORARY
TEXT
THEN
TIME
TIMESTAMP
TIMEZONE_HOUR
TIMEZONE_MINUTE
TINYINT
TO
TRAILING
TRANSLATE
TRANSLATE_REGEX
TRANSLATION
TREAT
TRIGGER
TRIM
TRIM_ARRAY
TRUE
TRUNCATE
UESCAPE
UNION
UNIQUE
UNKNOWN
UNNEST
UPDATE
UPPER
UPSERT
USER
USING
VALUE
VALUE_OF
VALUES
VARBINARY
VARCHAR
VAR_POP
VAR_SAMP
VARYING
VERSIONING
VIEW
WHEN
WHENEVER
WHERE
WIDTH_BUCKET
WINDOW
WITH
WITHIN
WITHOUT
WORK
YEAR
SRID
generate_random_strings
tf_compute_dwell_times
tf_feature_self_similarity
tf_feature_similarity
tf_geo_rasterize
tf_geo_rasterize_slope
tf_graph_shortest_path
tf_graph_shortest_paths_distances
tf_load_point_cloud
Mandelbrot set
tf_point_cloud_metadata
tf_raster_contour_lines
tf_raster_contour_polygons
tf_raster_graph_shortest_slope_weighted_path
POSIX regular expression syntax
regular expression in POSIX syntax
Mandelbrot set
generate_series (Integers)
generate_series (Timestamps)
tf_mandelbrot
tf_mandelbrot_cuda
tf_mandelbrot_cuda_float
tf_mandelbrot_float
tf_mandelbrot
tf_mandelbrot_float
tf_mandelbrot_cuda
tf_mandelbrot_cuda_float
approx_quantile_centroids
approx_quantile_buffer
Usage Notes
detection process
tf_rf_prop
tf_rf_prop_max_signal (Directional Antennas)
tf_rf_prop_max_signal (Isotropic Antennas)

x

Point x-coordinate

Column<DOUBLE>

y

Point y-coordinate

Column<DOUBLE>

z

Point z-coordinate

Column<DOUBLE>

intensity

Point intensity

Column<INT>

return_num

The ordered number of the return for a given LiDAR pulse. The first returns (lowest return numbers) are generally associated with the highest-elevation points for a LiDAR pulse, i.e. the forest canopy will generally have a lower return_num than the ground beneath it.

Column<TINYINT>

num_returns

The total number of returns for a LiDAR pulse. Multiple returns occur when there are multiple objects between the LiDAR source and the lowest ground or water elevation for a location.

Column<TINYINT>

scan_direction_flag

Column<TINYINT>

edge_of_flight_line_flag

Column<TINYINT>

classification

Column<SMALLINT>

scan_angle_rank

Column<TINYINT>

primary_key

Column containing keys/entity IDs that can be used to uniquely identify the entities for which the function will compute the similarity to the search vector specified by the comparison_features cursor. Examples include countries, census block groups, user IDs of website visitors, and aircraft call signs.

Column<TEXT ENCODING DICT | INT | BIGINT>

pivot_features

One or more columns constituting a compound feature. For example, two columns of visit hour and census block group would compare entities specified by primary_key based on whether they visited the same census block group in the same hour. If a single census block group feature column is used, the primary_key entities are compared only by the census block groups visited, regardless of time overlap.

Column<TEXT ENCODING DICT | INT | BIGINT>

metric

Column denoting the values used as input for the cosine similarity metric computation. In many cases, this is simply COUNT(*) such that feature overlaps are weighted by the number of co-occurrences.

Column<INT | BIGINT | FLOAT | DOUBLE>

comparison_pivot_features

One or more columns constituting a compound feature for the search vector. This should match in number of sub-features, types, and semantics pivot features.

Column<TEXT ENCODING DICT | INT | BIGINT>

comparison_metric

Column denoting the values used as input for the cosine similarity metric computation from the search vector. In many cases, this is simply COUNT(*) such that feature overlaps are weighted by the number of co-occurrences.

Column<TEXT ENCODING DICT | INT | BIGINT>

use_tf_idf

BOOLEAN

primary_key

Column containing keys/entity IDs that can be used to uniquely identify the entities for which the function computes co-similarity. Examples include countries, census block groups, user IDs of website visitors, and aircraft callsigns.

Column<TEXT ENCODING DICT | INT | BIGINT>

pivot_features

One or more columns constituting a compound feature. For example, two columns of visit hour and census block group would compare entities specified by primary_key based on whether they visited the same census block group in the same hour. If a single census block group feature column is used, the primary_key entities would be compared only by the census block groups visited, regardless of time overlap.

Column<TEXT ENCODING DICT | INT | BIGINT>

metric

Column denoting the values used as input for the cosine similarity metric computation. In many cases, this is COUNT(*) such that feature overlaps are weighted by the number of co-occurrences.

Column<INT | BIGINT | FLOAT | DOUBLE>

use_tf_idf

BOOLEAN

Window Functions

Window functions allow you to work with a subset of rows related to the currently selected row. For a given dimension, you can find the most associated dimension by some other measure (for example, number of records or sum of revenue).

Window functions must always contain an OVER clause. The OVER clause splits up the rows of the query for processing by the window function.

The PARTITION BY list divides the rows into groups that share the same values of the PARTITION BY expression(s). For each row, the window function is computed using all rows in the same partition as the current row.

Rows that have the same value in the ORDER BY clause are considered peers. The ranking functions give the same answer for any two peer rows.

Supported Window Functions

Function

Description

BACKWARD_FILL(value)

Replace the null value by using the nearest non-null value of the value column, using backward search.

For example, for column x, with the current row r at the index K having a NULL value, and assuming column x has N rows (where K < N):

BACKWARD_FILL(x) searches for the non-NULL value by searching rows with the index starting from K+1 to N. The NULL value is replaced with the first non-NULL value found.

At least one ordering column must be defined in the window clause.

NULLS FIRST ordering of the input value is added automatically for any user-defined ordering of the input value. For example:

BACKWARD_FILL(x) OVER (PARTITION BY c ORDER BY x) - No ordering is added; ordering already exists on x. BACKWARD_FILL(x) OVER (PARTITION BY c ORDER BY o) - Ordering is added internally for a consistent query result.

CONDITIONAL_CHANGE_EVENT(expr)

For each partition, a zero-initialized counter is incremented every time the result of expr changes as the expression is evaluated over the partition. Requires an ORDER BY clause for the window.

COUNT_IF(condition_expr)

Aggregate function that can be used as a window function for both a nonframed window partition and a window frame. Returns the number of rows satisfying the given condition_expr, which must evaluate to a Boolean value (TRUE/FALSE) like x IS NULL or x > 1.

CUME_DIST()

Cumulative distribution value of the current row: (number of rows preceding or peers of the current row)/(total rows). Window framing is ignored.

DENSE_RANK()

Rank of the current row without gaps. This function counts peer groups. Window framing is ignored.

FIRST_VALUE(value)

Returns the value from the first row of the window frame (the rows from the start of the partition to the last peer of the current row).

FORWARD_FILL(value)

Replace the null value by using the nearest non-null value of the value column, using forward search. For example, for column x, with the current row r at the index K having a NULL value, and assuming column x has N rows (where K < N): FORWARD_FILL(x) searches for the non-NULL value by searching rows with the index starting from K-1 to 1. The NULL value is replaced with the first non-NULL value found. At least one ordering column must be defined in the window clause.

NULLS FIRST ordering of the input value is added automatically for any user-defined ordering of the input value. For example: FORWARD_FILL(x) OVER (PARTITION BY c ORDER BY x) - No ordering is added; ordering already exists on x. FORWARD_FILL(x) OVER (PARTITION BY c ORDER BY o) - Ordering is added internally for a consistent query result.

LAG(value, offset)

LAST_VALUE(value)

Returns the value from the last row of the window frame.

LEAD(value, offset)

NTH_VALUE(expr,N)

Returns a value of expr at row N of the window partition.

NTILE(num_buckets)

Subdivide the partition into buckets. If the total number of rows is divisible by num_buckets, each bucket has a equal number of rows. If the total is not divisible by num_buckets, the function returns groups of two sizes with a difference of 1. Window framing is ignored.

PERCENT_RANK()

Relative rank of the current row: (rank-1)/(total rows-1). Window framing is ignored.

RANK()

Rank of the current row with gaps. Equal to the row_number of its first peer.

ROW_NUMBER()

Number of the current row within the partition, counting from 1. Window framing is ignored.

SUM_IF(condition_expr)

Aggregate function that can be used as a window function for both a nonframed window partition and a window frame. Returns the sum of all expression values satisfying the given condition_expr. Applies to numeric data types.

HeavyDB supports the aggregate functions AVG, MIN, MAX, SUM, and COUNT in window functions.

Updates on window functions are supported, assuming the target table is single-fragment. Updates on multi-fragment target tables are not currently supported.

Example

This query shows the top airline carrier for each state, based on the number of departures.

select origin_state, carrier_name, n 
   from (select origin_state, carrier_name, row_number() over(
      partition by origin_state order by n desc) as rownum, n 
         from (select origin_state, carrier_name, count(*) as n 
            from flights_2008_7M where extract(year 
               from dep_timestamp) = 2008 
   group by origin_state, carrier_name )) where rownum = 1

Window Frames

A window function can include a frame clause that specifies a set of neighboring rows of the current row belonging to the same partition. This allows us to compute a window aggregate function over the window frame, instead of computing it against the entire partition. Note that a window frame for the current row is computed based on either 1) the number of rows before or after the current row (called rows mode) or 2) the specified ordering column value in the frame clause (called range mode).

For example:

  • From the starting row of the partition to the current row: Using the sum aggregate function, you can compute the running sum of the partition.

  • You can construct a frame based on the position of the rows (called rows mode): For example, a row before 3 rows and after 2 rows:

    • You can compute the aggregate function of the frame having up to six rows (including the current row).

  • You can organize a frame based on the value of the ordering column (called range mode): Assuming C as the current ordering column value, we can compute aggregate value of the window frame which contains rows having ordering column values between (C - 3) and (C + 2).

Window functions that ignore the frame are evaluated on the entire partition.

Note that we can define the window frame clause using rows mode with an ordering column.

You can use the following aggregate functions with the window frame clause.

Supported Functions

Category
Supported Functions

Frame aggregation

MIN(val), MAX(val), COUNT(val), AVG(val), SUM(val)

Frame navigation

LEAD_IN_FRAME(value, offset)

LAG_IN_FRAME(value, offset)

FIRST_VALUE_IN_FRAME

LAST_VALUE_IN_FRAME

NTH_VALUE_IN_FRAME

Syntax

<frame_mode> | <frame_bound> <frame_mode>can be one of the following:

  • rows

  • range

Example

1 | 2 | 3 | 4 | 5.5 | 7.5 | 8 | 9 | 10 → value of a each tuple’s order by expression.

When the current row has a value 5.5:

  • ROWS BETWEEN 3 PRECEDING and 3 FOLLOWING : 3 rows before and 3 rows after → {2, 3, 4, 5.5, 7.5, 8, 9 }

  • RANGE BETWEEN 3 PRECEDING and 3 FOLLOWING: 5.5 - 3 <= x <= 5.5 + 3 → { 3, 4, 5.5, 8 }

<frame_bound>:

  • frame_start or

  • frame_between: between frame_start and frame_end

frame_start and frame_end can be one of the following:

  • UNBOUNDED PRECEDING: The start row of the partition that the current row belongs to.

  • UNBOUNDED FOLLOWING: The end row of the partition that the current row belongs to.

  • CURRENT ROW

    • For rows mode: the current row.

    • For range mode: the peers of the current row. A peer is a row having the same value as the ordering column expression of the current row. Note that all null values are peers of each other.

  • expr PRECEDING

    • For rows mode: expr row before the current row.

    • For range mode: rows with the current row’s ordering expression value minus expr.

    • For DATE, TIME, and TIMESTAMP: Use the INTERVAL keyword with a specific time unit, depending on a data type:

      • TIMESTAMP type: NANOSECOND, MICROSECOND, MILLISECOND, SECOND, MINUTE, HOUR, DAY, MONTH, and YEAR

      • TIME type: SECOND, MINUTE, and HOUR

      • DATE type: DAY, MONTH, and YEAR

        For example: RANGE BETWEEN INTERVAL 1 DAY PRECEDING and INTERVAL 3 DAY FOLLOWING

    • Currently, only literal expressions as expr such as 1 PRECEDING and 100 PRECEDING are supported.

  • expr FOLLOWING

    • For rows mode: expr row after the current row.

    • For range mode: rows with the current row’s ordering expression value plus expr.

    • For DATE, TIME, and TIMESTAMP: Use the INTERVAL keyword with a specific time unit, depending on a data type:

      • TIMESTAMP type: NANOSECOND, MICROSECOND, MILLISECOND, SECOND, MINUTE, HOUR, DAY, MONTH, and YEAR

      • TIME type: SECOND, MINUTE, and HOUR

      • DATE type: DAY, MONTH, and YEAR

        For example: RANGE BETWEEN INTERVAL 1 DAY PRECEDING and INTERVAL 3 DAY FOLLOWING

    • Currently, only support literal expression as expr such as 1 FOLLOWING and 100 FOLLOWING are supported.

UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING have the same meaning in both rows and range mode.

When the query has no window frame bound, the window aggregate function is computed differently depending on the existence of the ORDER BY clause:

  • Has ORDER BY clause: The window function is computed with the default frame bound, which is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

  • No Order BY clause: The window function is computed over the entire partition.

Named Window Function Clause

You can refer to the same window clause in multiple window aggregate functions by defining it with a unique name in the query definition.

For example, you can define the named window clauses W1 and W2 as follows:

select min(x) over w1, max(x) over w2 from test window w1 as (order by y), 
  w2 as (partition by y order by z rows between 2 preceding and 2 following);

Named window function clause w1 refers to a window function clause without a window frame clause, and w2 refers to a named window frame clause.

Notes and Restrictions

  • To use window framing, you may need an ORDER BY clause in the window definition. Depending on the framing mode used, the constraint varies:

    • Row mode: no restriction of the existence of the ordering column. It also can include multiple ordering columns.

    • Range mode: only a single ordering column is required (not multi-column ordering).

  • Currently, all window functions including aggregation over window frame are computed via CPU-mode.

  • For window frame bound expressions, only non-negative integer literals are supported.

  • GROUPING mode and EXCLUDING are not currently supported.

tf_raster_graph_shortest_slope_weighted_path

Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type) across all points in each bin as the output value for the bin. A Gaussian average is then taken over the neighboring bins, with the number of bins specified by neighborhood_fill_radius, optionally only filling in null-valued bins if fill_only_nulls is set to true.

The graph shortest path is then computed between an origin point on the grid specified by origin_x and origin_y and a destination point on the grid specified by destination_x and destination_y, where the shortest path is weighted by the nth exponent of the computed slope between a bin and its neighbors, with the nth exponent being specified by slope_weighted_exponent. A max allowed traversable slope can be specified by slope_pct_max, such that no traversal is considered or allowed between bins with absolute computed slopes greater than the percentage specified by slope_pct_max.

SELECT * FROM TABLE(
    tf_raster_graph_shortest_slope_weighted_path(
        raster => CURSOR(
            SELECT x, y, z FROM table
        ),
        agg_type => <'AVG'|'COUNT'|'SUM'|'MIN'|'MAX'>,
        bin_dim => <meters>,
        geographic_coords => <true/false>,
        neighborhood_fill_radius => <num bins>,
        fill_only_nulls => <true/false>,
        origin_x => <origin x coordinate>,
        origin_y => <origin y coordinate>,
        destination_x => <destination x coordinate>,
        destination_y => <destination y coordinate>,
        slope_weighted_exponent => <exponent>,
        slope_pct_max => <max pct slope>
    )

Input Arguments

Parameter
Description
Data Types

x

Input x-coordinate column or expression of the data to be rasterized.

Column <FLOAT | DOUBLE>

y

Input y-coordinate column or expression of the data to be rasterized.

Column <FLOAT | DOUBLE> (must be the same type as x)

z

Input z-coordinate column or expression of the data to be rasterized.

Column <FLOAT | DOUBLE>

agg_type

The aggregate to be performed to compute the output z-column. Should be one of 'AVG', 'COUNT', 'SUM', 'MIN', or 'MAX'.

TEXT ENCODING NONE

bin_dim

The width and height of each x/y bin . If geographic_coords is true, the input x/y units will be translated to meters according to a local coordinate transform appropriate for the x/y bounds of the data.

DOUBLE

geographic_coords

If true, specifies that the input x/y coordinates are in lon/lat degrees. The function will then compute a mapping of degrees to meters based on the center coordinate between x_min/x_max and y_min/y_max.

BOOLEAN

neighborhood_bin_radius

The radius in bins to compute the gaussian blur/filter over, such that each output bin will be the average value of all bins within neighborhood_fill_radius bins.

BIGINT

fill_only_nulls

Specifies that the gaussian blur should only be used to provide output values for null output bins (i.e. bins that contained no data points or had only data points with null Z-values).

BOOLEAN

origin_x

The x-coordinate for the starting point for the graph traversal, in input (not bin) units.

DOUBLE

origin_y

The y-coordinate for the starting point for the graph traversal, in input (not bin) units.

DOUBLE

destination_x

The x-coordinate for the destination point for the graph traversal, in input (not bin) units.

DOUBLE

destination_y

The y-coordinate for the destination point for the graph traversal, in input (not bin) units.

DOUBLE

slope_weighted_exponent

The slope weight between neighboring raster cells will be weighted by the slope_weighted_exponent power. A value of 1 signifies that the raw slopes between neighboring cells should be used, increasing this value from 1 will more heavily penalize paths that traverse steep slopes.

DOUBLE

slope_pct_max

The max absolute value of slopes (measured in percentages) between neighboring raster cells that will be considered for traversal. A neighboring graph cell with an absolute slope greater than this amount will not be considered in the shortest slope-weighted path graph traversal

DOUBLE

Output Columns

/* Compute the shortest slope weighted path over a 30m Copernicus 
Digital Elevation Model (DEM) input raster comprising the area around Mt. Everest,
to compute the shorest slope-weighted path from the plains of Nepal to the peak */

create table mt_everest_climb as
select
  path_step,
  st_setsrid(st_point(x, y), 4326) as path_pt
from
  table(
    tf_raster_graph_shortest_slope_weighted_path(
      raster => cursor(
        select
          st_x(raster_point),
          st_y(raster_point),
          z
        from
          copernicus_30m_mt_everest
      ),
      agg_type => 'AVG',
      bin_dim => 30,
      geographic_coords => TRUE,
      neighborhood_fill_radius => 1,
      fill_only_nulls => FALSE,
      origin_x => 86.01,
      origin_y => 27.01,
      destination_x => 86.9250,
      destination_y => 27.9881,
      slope_weight_exponent => 4,
      slope_pct_max => 50
    )
  );

tf_raster_contour_lines; tf_raster_contour_polygons

Process a raster input to derive contour lines or regions and output as LINESTRING or POLYGON for rendering or further processing. Each has two variants:

The direct variants require that the input rows represent a rectilinear region of pixels in nonsparse row-major order. The dimensions must also be provided, and (raster_width * raster_height) must match the input row count. The contour processing is then performed directly on the raster values with no preprocessing.

The line variants generate LINESTRING geometries that represent the contour lines of the raster space at the given interval with the optional given offset. For example, a raster space representing a height field with a range of 0.0 to 1000.0 will likely result in 10 or 11 lines, each with a corresponding contour_values value, 0.0, 100.0, 200.0 etc. If contour_offset is set to 50.0, then the lines are generated at 50.0, 150.0, 250.0, and so on. The lines can be open or closed and can form rings or terminate at the edges of the raster space.

The polygon variants generate POLYGON geometries that represent regions between contour lines (for example from 0.0 to 100.0), and from 100.0 to 200.0. If the raster space has multiple regions with that value range, then a POLYGON row is output for each of those regions. The corresponding contour_values value for each is the lower bound of the range for that region.

Rasterizing Variant

SELECT
  contour_[lines|polygons],
  contour_values
FROM TABLE(
  tf_raster_contour_[lines|polygons](
    raster => CURSOR(
      <lon>,
      <lat>,
      <value>
    ),
    agg_type => ‘<agg_type>’,
    bin_dim_meters => <bin_dim_meters>,
    neighborhood_fill_radius => <neighborhood_fill_radius>,
    fill_only_nulls => <fill_only_nulls>,
    fill_agg_type => ‘<fill_agg_type>’,
    flip_latitude => <flip_latitude>,
    contour_interval => <contour_interval>,
    contour_offset => <contour_offset>
  )
);

Direct Variant

SELECT
  contour_[lines|polygons],
  contour_values
FROM TABLE(
  tf_raster_contour_[lines|polygons](
    raster => CURSOR(
      <lon>,
      <lat>,
      <value>
    ),
    raster_width => <raster_width>,
    raster_height => <raster_height>,
    flip_latitude => <flip_latitude>,
    contour_interval => <contour_interval>,
    contour_offset => <contour_offset>
  )
);

Input Arguments

Parameter
Description
Data Types

lon

Longitude value of raster point (degrees, SRID 4326).

Column<FLOAT | DOUBLE>

lat

Latitude value of raster point (degrees, SRID 4326).

Column<FLOAT | DOUBLE> (must be the same as <lon>)

value

Raster band value from which to derive contours.

Column<FLOAT | DOUBLE>

agg_type

bin_dim_meters

neighborhood_fill_radius

fill_only_nulls

fill_agg_type

flip_latitude

Optionally flip resulting geometries in latitude (default FALSE).

(This parameter may be removed in future releases)

BOOLEAN

contour_interval

Desired contour interval. The function will generate a line at each interval, or a polygon region that covers that interval.

FLOAT/DOUBLE (must be same type as value)

contour_offset

Optional offset for resulting intervals.

FLOAT/DOUBLE (must be same type as value)

raster_width

Pixel width (stride) of the raster data.

INTEGER

raster_height

Pixel height of the raster data.

INTEGER

Output Columns

Name
Description
Data Types

contour_[lines|polygons]

Output geometries.

Column<LINESTRING | POLYGON>

contour_values

Raster values associated with each contour geometry.

Column<FLOAT | DOUBLE> (will be the same type as value)

Examples

SELECT
  contour_lines,
  contour_values
FROM TABLE(
  tf_raster_contour_lines(
    raster => CURSOR(
      SELECT
        lon,
        lat,
        elevation
      FROM
        elevation_table
    ),
    agg_type => ‘AVG’,
    bin_dim_meters => 10.0,
    neighborhood_fill_radius => 0,
    fill_only_nulls => FALSE,
    fill_agg_type => ‘AVG’,
    flip_latitude => FALSE,
    contour_interval => 100.0,
    contour_offset => 0.0
  )
);
SELECT
  contour_polygons,
  contour_values
FROM TABLE(
  tf_raster_contour_polygons(
    raster => CURSOR(
      SELECT
        lon,
        lat,
        elevation
      FROM
        elevation_table
    ),
    raster_width => 1024,
    raster_height => 1024,
    flip_latitude => FALSE,
    contour_interval => 100.0,
    contour_offset => 0.0
  )
);

Control Panel

The Control Panel gives super users visibility into roles and users of the current database, as well as feature flags, system table dashboards, and log files for the current HeavyDB instance.

To open the Control Panel, click the Account icon and then click Control Panel.

The Control Panel is considered beta functionality. Currently, you cannot add, delete, or edit roles or users in the Control Panel. Feature flags cannot be modified through the Control Panel.

To access the Control Panel, users need to have super user privileges, or the role “immerse_control_panel” assigned.

Feature Flags

To see which feature flags are currently set in Immerse, click Feature Flags under Customization.

Currently, feature flags can only be viewed in Immerse; they cannot be set or removed.

System Dashboard and Log Files

Links to the the following System Table dashboards are available on the Control Panel:

Links to the following log files are are available on the Control Panel:

Admin Portal

The Admin Portal is a collection of dashboards available in the included information_schema database in Heavy Immerse. The dashboards display point-in-time information of the HEAVY.AI platform resources and users of the system.

Access to system dashboards is controlled using Immerse privileges; only users with Admin privileges or users/roles with access to the information_schema database can access the system dashboards.

With the Admin Portal, you can see:

  • Database monitoring and database and web server logs.

  • Real-time data reporting for the system.

  • Point-in-time resource metrics and user engagement dashboards.

When you log in to the information_schema database, you see the Request Logs and Monitoring, System Resources, and User Roles and Permissions dashboards.

Request Logs and Monitoring

By default, the Request Logs and Monitoring dashboard does not appear in the Admin portal. To turn on the dashboard, set the enable-logs-system-tables parameter to TRUE in heavy.conf and restart the database.

The Request Logs and Monitoring dashboard includes the following charts on three tabs:

  • Number of Requests

  • Number of Fatals and Errors

  • Number of Unique Users

  • Avg Request Time (ms)

  • Max Request Time (ms)

  • Number of Requests per Dashboard

  • Number of Requests per API

  • Number of Requests per User

  • Database Server Logs - Sortable by log timestamp, severity level, message, file location, process ID, query ID, thread ID, and node.

  • Database Queries - Sortable by log timestamp, query string, execution time, and total time.

  • Web Server Logs - Sortable by log timestamp, severity, and message.

  • Web Server Access Logs - Sortable by log timestamp, endpoint, HTTP status, HTTP method, IP address, and response size.

System Resources Dashboard

The System Resources dashboard includes the following charts on three tabs:

  • Databases - Names of all available databases

  • # of Tables - Total number of tables

  • # of Dashboards - Total number of dashboards

  • # of Tables Per Database

  • # of Dashboards Per Database

  • Tables - Sortable name, column count, and owner information for all tables.

  • Dashboards - Sortable name, last update time, and owner information for all databases.

  • CPU Memory Utilization - Free, used, and unallocated

  • GPU Memory Utilization - Free, used, and unallocated

  • Tables with Highest CPU Memory Utilization

  • Tables with Highest GPU Memory Utilization

  • Columns with Highest CPU Memory Utilization

  • Columns with Highest GPU Memory Utilization

  • Tables with Highest Storage Utilization

  • Total Used Storage

User Roles and Permissions Dashboard

The User Roles and Permission Dashboard includes the following charts:

  • # of Users - Total number of users on the system

  • # of Roles - Total number of roles on the system

  • # of Table Owners - Total number of table owners on the system

  • # of Dashboard Owners - Total number of dashboard owners on the system

  • Users - Sortable list of users on the system

  • User-Role Assignments - Mapping of role names to user names, sortable by role or user

  • Roles - Sortable list of roles on the system

  • Databases - Sortable list of databases on the system

  • User Permissions - Mapping of user or role name, permission type, and database, sortable by any column.

Introduction to Heavy Immerse

Heavy Immerse is a browser-based data visualization client that runs on top of the GPU-powered HeavyDB. It provides instantaneous representations of your data, from basic charts to rich and complex visualizations.

Immerse is installed with HEAVY.AI Enterprise Edition.

To create dashboards and data visualizations, click DASHBOARDS. You can search for dashboards, and list them by most recent or alphabetically.

Click DATA to import and manipulate data.

Click SQL EDITOR to perform Data Definition and Data Manipulation tasks on the command line.

When you navigate between the three utilities, you can:

  • Hold the command (ctrl) key as you click a link to open the utility in a new tab/window in the background.

  • Hold shift+command (ctrl) as you click a link to open the utility in a new tab/window in the foreground.

  • Hold no keys as you click a link to replace the contents of the current window.

HELP CENTER provides access to Immerse version information, tutorials, demos, and documentation. It also includes a link for sending email to HEAVY.AI.

Clicking the user icon at the far right opens a drop-down box where you can select a different database, change your UI theme, or log out of Immerse:

HEAVY.AI Installation on Ubuntu

This is an end-to-end recipe for installing HEAVY.AI on a Ubuntu 20.04 machine using CPU and GPU devices.

The order of these instructions is significant. To avoid problems, install each component in the order presented.

Assumptions

These instructions assume the following:

  • You are installing on a “clean” Ubuntu 20.04 host machine with only the operating system installed.

  • Your HEAVY.AI host only runs the daemons and services required to support HEAVY.AI.

  • Your HEAVY.AI host is connected to the Internet.

Preparation

Prepare your Ubuntu machine by updating your system, creating the HEAVY.AI user (named heavyai), installing kernel headers, installing CUDA drivers, and optionally enabling the firewall.

Update and Reboot

  1. Update the entire system:

2. Install the utilities needed to create Heavy.ai repositories and download archives:

3. Install the headless JDK and the utility apt-transport-https:

4. Reboot to activate the latest kernel:

Create the HEAVY.AI User

Create a group called heavyai and a user named heavyai, who will be the owner of the HEAVY.AI software and data on the filesystem.

  1. Create the group, user, and home directory using the useradd command with the --user-group and --create-home switches.

2. Set a password for the user:

3. Log in with the newly created user:

Installation

Install the HEAVY.AI using APT and a tarball.

The installation using the APT package manager is recommended to those who want a more automated install and upgrade procedure.

Install Nvidia Drivers ᴳᴾᵁ ᴼᴾᵀᴵᴼᴺ

Installing with APT

Download and add a GPG key to APT.

Add a source apt depending on the edition (Enterprise, Free, or Open Source) and execution device (GPU or CPU) you are going to use.

Use apt to install the latest version of HEAVY.AI.

If you need to install a specific version of HEAVY.AI, because you are upgrading from Omnisci or for different reasons, you must run the following command:

Installing with a Tarball

First create the installation directory.

Download the archive and install the software. A different archive is downloaded depending on the Edition (Enterprise, Free, or Open Source) and the device used for runtime (GPU or CPU).

Configuration

Follow these steps to prepare your HEAVY.AI environment.

Set Environment Variables

For convenience, you can update .bashrc with these environment variables

Although this step is optional, you will find references to the HEAVYAI_BASE and HEAVYAI_PATH variables. These variables contain respectively the paths where configuration, license, and data files are stored and where the software is installed. Setting them is strongly recommended.

Initialization

Run the systemd installer to create heavyai services, a minimal config file, and initialize the data storage.

Accept the default values provided or make changes as needed.

The script creates a data directory in $HEAVYAI_BASE/storage (default /var/lib/heavyai/storage) with the directories catalogs, data, export and log.The import directory is created when you insert data the first time. If you are HEAVY.AI administrator, the log directory is of particular interest.

Activation

Heavy Immerse is not available in the OSS Edition, so if running the OSS Edition the systemctl command using the heavy_web_server has no effect.

Enable the automatic startup of the service at reboot and start the HEAVY.AI services.

Configure Firewall ᴼᴾᵀᴵᴼᴺᴬᴸ

If a firewall is not already installed and you want to harden your system, install theufw.

To use Heavy Immerse or other third-party tools, you must prepare your host machine to accept incoming HTTP(S) connections. Configure your firewall for external access.

Most cloud providers use a different mechanism for firewall configuration. The commands above might not run in cloud deployments.

Licensing HEAVY.AI ᵉᵉ⁻ᶠʳᵉᵉ ᵒⁿˡʸ

If you are using Enterprise or Free Edition, you need to validate your HEAVY.AI instance with your license key.

  1. Copy your license key of Enterprise or Free Edition from the registration email message. If you do not have a license and you want to evaluate HEAVI.AI in an unlimited

  2. Connect to Heavy Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273.

  3. When prompted, paste your license key in the text box and click Apply.

  4. Log into Heavy Immerse by entering the default username (admin) and password (HyperInteractive), and then click Connect.

    .

Final Checks

Load Sample Data and Run a Simple Query

HEAVY.AI ships with two sample datasets of airline flight information collected in 2008, and a census of New York City trees. To install sample data, run the following command.

Connect to HeavyDB by entering the following command in a terminal on the host machine (default password is HyperInteractive):

Enter a SQL query such as the following

The results should be similar to the results below.

After installing Enterprise or Free Edition, check if Heavy Immerse is running as intended.

  1. Connect to Heavy Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273.

  2. Log into Heavy Immerse by entering the default username (admin) and password (HyperInteractive), and then click Connect.

Create a new dashboard and a Scatter Plot to verify that backend rendering is working.

  1. Click New Dashboard.

  2. Click Add Chart.

  3. Click SCATTER.

  4. Click Add Data Source.

  5. Choose the flights_2008_10k table as the data source.

  6. Click X Axis +Add Measure.

  7. Choose depdelay.

  8. Click Y Axis +Add Measure.

  9. Choose arrdelay.

  10. Click Size +Add Measure.

  11. Choose airtime.

  12. Click Color +Add Measure.

  13. Choose dest_state.

The resulting chart shows, unsurprisingly, that there is a correlation between departure delay and arrival delay.

Create a new dashboard and a Table chart to verify that Heavy Immerse is working.

  1. Click New Dashboard.

  2. Click Add Chart.

  3. Click Bubble.

  4. Click Select Data Source.

  5. Choose the flights_2008_10k table as the data source.

  6. Click Add Dimension.

  7. Choose carrier_name.

  8. Click Add Measure.

  9. Choose depdelay.

  10. Click Add Measure.

  11. Choose arrdelay.

  12. Click Add Measure.

  13. Choose #Records.

The resulting chart shows, unsurprisingly, that also the average departure delay is correlated to the average of arrival delay, while there is quite a difference between Carriers.

¹ In the OS Edition, Heavy Immerse is unavailable.

² The OS Edition does not require a license key.

From the : "The scan direction flag denotes the direction at which the scanner mirror was traveling at the time of the output pulse. A bit value of 1 is a positive scan direction, and a bit value of 0 is a negative scan direction."

From the : "The edge of flight line data bit has a value of 1 only when the point is at the end of a scan. It is the last point on a given scan line before it changes direction."

From the : "The classification field is a number to signify a given classification during filter processing. The ASPRS standard has a public list of classifications which shall be used when mixing vendor specific user software."

From the : "The angle at which the laser point was output from the laser system, including the roll of the aircraft... The scan angle is an angle based on 0 degrees being NADIR, and –90 degrees to the left side of the aircraft in the direction of flight."

Boolean constant denoting whether weighting should be used in the cosine similarity score computation.

Boolean constant denoting whether weighting should be used in the cosine similarity score computation.

Returns the value at the row that is offset rows before the current row within the partition. is the window-frame-aware version.

Returns the value at the row that is offset rows after the current row within the partition. is the window-frame-aware version.

These are window-frame-aware versions of the , , , , and NTH_VALUE functions.

One that re-rasterizes the input points ()

One which accepts raw raster points directly ()

Use the rasterizing variants if the raster table rows are not already sorted in row-major order (for example, if they represent an arbitrary 2D point cloud), or if filtering or binning is required to reduce the input data to a manageable count (to speed up the contour processing) or to smooth the input data before contour processing. If the input rows do not already form a rectilinear region, the output region will be their 2D bounding box. Many of the parameters of the rasterizing variant are directly equivalent to those of ; see that function for details.

See

See

See

See

See

The information_schema database and Admin Portal dashboards and system tables are installed when you install or upgrade to HEAVY.AI 6.0. For more detailed information on the tables available in the Admin Portal, see .

If your system uses NVIDIA GPUs, but the drivers not installed, install them now. See for details.

Start and use HeavyDB and Heavy Immerse.

For more information, see .

Skip this section if you are on Open Source Edition

enterprise environment, contact your Sales Representative or register for your 30-day trial of Enterprise Edition . If you need a Free License you can get one .

To verify that everything is working, load some sample data, perform a heavysql query, and generate a Pointmap using Heavy Immerse

Create a Dashboard Using Heavy Immerse ᵉᵉ⁻ᶠʳᵉᵉ ᵒⁿˡʸ

ASPRS LiDAR Data Exchange Format Standard
ASPRS LiDAR Data Exchange Format Standard
ASPRS LiDAR Data Exchange Format Standard
ASPRS LiDAR Data Exchange Format Standard
TF-IDF
TF-IDF
System Tables
System Resources
Request Logs and Monitoring
User Roles and Permissions
Access Logs (Web Server)
All Logs (Web Server)
Error Logs (HeavyDB)
Info Logs (HeavyDB)
Warning Logs (HeavyDB)
rasterizing variant
direct variant
tf_geo_rasterize
sudo apt update
sudo apt upgrade
sudo apt install curl
sudo apt install libncurses5
sudo apt install default-jre-headless apt-transport-https
sudo reboot
sudo useradd --user-group --create-home --group sudo heavyai
sudo passwd heavyai
sudo su - heavyai
curl https://releases.heavy.ai/GPG-KEY-heavyai | sudo apt-key add -
echo "deb https://releases.heavy.ai/ee/apt/ stable cuda" \
| sudo tee /etc/apt/sources.list.d/heavyai.list
echo "deb https://releases.heavy.ai/ee/apt/ stable cpu" \
| sudo tee /etc/apt/sources.list.d/heavyai.list
echo "deb https://releases.heavy.ai/os/apt/ stable cuda" \
| sudo tee /etc/apt/sources.list.d/heavyai.list
echo "deb https://releases.heavy.ai/os/apt/ stable cpu" \
| sudo tee /etc/apt/sources.list.d/heavyai.list
sudo apt update
sudo apt install heavyai
hai_version="6.0.0"
sudo apt install heavyai=$(apt-cache madison heavyai | grep $hai_version | cut -f 2 -d '|' | xargs)
sudo mkdir /opt/heavyai && chown $USER /opt/heavyai
curl \
https://releases.heavy.ai/ee/tar/heavyai-ee-latest-Linux-x86_64-render.tar.gz \
| sudo tar zxf - --strip-components=1 -C /opt/heavyai
curl \
https://releases.heavy.ai/ee/tar/heavyai-ee-latest-Linux-x86_64-cpu.tar.gz \
| sudo tar zxf - --strip-components=1 -C /opt/heavyai
curl \
https://releases.heavy.ai/os/tar/heavyai-os-latest-Linux-x86_64.tar.gz \
| sudo tar zxf - --strip-components=1 -C /opt/heavyai
curl \
https://releases.heavy.ai/os/tar/heavyai-os-latest-Linux-x86_64-cpu.tar.gz \
| sudo tar zxf - --strip-components=1 -C /opt/heavyai
echo "# HEAVY.AI variable and paths
export HEAVYAI_PATH=/opt/heavyai
export HEAVYAI_BASE=/var/lib/heavyai
export HEAVYAI_LOG=$HEAVYAI_BASE/storage/log
export PATH=$HEAVYAI_PATH/bin:$PATH" \
>> ~/.bashrc
source ~/.bashrc
cd $HEAVYAI_PATH/systemd
./install_heavy_systemd.sh
sudo systemctl enable heavydb --now
sudo systemctl enable heavy_web_server --now
sudo systemctl enable heavydb --now
sudo apt install ufw
sudo ufw allow ssh
sudo ufw disable
sudo ufw allow 6273:6278/tcp
sudo ufw enable
cd $HEAVYAI_PATH
sudo ./insert_sample_data --data /var/lib/heavyai/storage
#     Enter dataset number to download, or 'q' to quit:
Dataset           Rows    Table Name          File Name
1)    Flights (2008)    7M      flights_2008_7M     flights_2008_7M.tar.gz
2)    Flights (2008)    10k     flights_2008_10k    flights_2008_10k.tar.gz
3)    NYC Tree Census (2015)    683k    nyc_trees_2015_683k    nyc_trees_2015_683k.tar.gz
$HEAVYAI_PATH/bin/heavysql
password: ••••••••••••••••
SELECT origin_city AS "Origin", 
dest_city AS "Destination", 
AVG(airtime) AS "Average Airtime" 
FROM flights_2008_10k WHERE distance < 175 
GROUP BY origin_city, dest_city;
Origin|Destination|Average Airtime
Austin|Houston|33.055556
Norfolk|Baltimore|36.071429
Ft. Myers|Orlando|28.666667
Orlando|Ft. Myers|32.583333
Houston|Austin|29.611111
Baltimore|Norfolk|31.714286
CentOS/RHEL
Ubuntu
Ubuntu
Install NVIDIA Drivers and Vulkan on Ubuntu
https://help.ubuntu.com/lts/serverguide/firewall.html
here
here
LAG_IN_FRAME
LEAD_IN_FRAME
LEAD
LAG
FIRST_VALUE
LAST_VALUE
¹
²
¹
¹
tf_geo_rasterize
tf_geo_rasterize
tf_geo_rasterize
tf_geo_rasterize
tf_geo_rasterize

Using Utilities

initdb

Before using HeavyDB, initialize the data directory using initdb:

initdb [-f | --skip-geo] $HEAVYAI_BASE/storage

This creates three subdirectories:

  • catalogs: Stores HeavyDB catalogs

  • data: Stores HeavyDB data

  • log: Contains all HeavyDB log files.

  • disk_cache: Stores the data cached by HEAVY COnnect

The -f flag forces initdb to overwrite existing data and catalogs in the specified directory.

By default, initdb adds a sample table of geospatial data. Use the --skip-geo flag if you prefer not to load sample geospatial data.

generate_cert

generate_cert [{-ca} <bool>]
              [{-duration} <duration>]
              [{-ecdsa-curve} <string>]
              [{-host} <host1,host2>]
              [{-rsa-bits} <int>]
              [{-start-date} <string>]

This command generates certificates and private keys for an HTTPS server. The options are:

  • [{-ca} <bool>]: Whether this certificate should be its own Certificate Authority. The default is false.

  • [{-duration} <duration>]: Duration that the certificate is valid for. The default is 8760h0m0s.

  • [{-ecdsa-curve} <string>]: ECDSA curve to use to generate a key. Valid values are P224, P256, P384, P521.

  • [{-host} <string>]: Comma-separated hostnames and IPs to generate a certificate for.

  • [{-rsa-bits} <int>]: Size of RSA key to generate. Ignored if –ecdsa-curve is set. The default is 2048.

  • [{-start-date} <string>]: Start date formatted as Jan 1 15:04:05 2011

HeavyDB includes the utilities for database initialization and for generating certificates and private keys for an HTTPS server.

initdb
generate_cert

Uber H3 Hexagonal Modeling

Uber H3 Functions

Overview

Uber H3 is an open-source geospatial system created by Uber Technologies . H3 provides a hierarchical grid system that divides the Earth's surface into hexagons of varying sizes, allowing for easy location-based indexing, search, and analysis.

Hexagons can be created at a single scale, for instance to fill an arbitrary polygon at one resolution (see below). They can also be used to generate a much-smaller number of hexagons at multiple scales. In general, operating on h3 hexagons is much faster than on raw arbitrary geometries, at a cost of some precision. Because each hexagon is exactly the same size, this is particularly advantageous for GPU-accelerated workflows.

Advantages

A principal advantage of the system is that for a given scale, hexagons are approximately-equal area. This stands in contrast to other subdivision schemes based on longitudes and latitudes or web Mercator map projections.

A second advantage is that with hexagons, neighbors in all directions are equidistant. This is not true for rectangular subdivisions like pixels, whose 8 neighbors are at different distances.

The exact amount of precision lost can be tightly bounded, with the smallest sized hexagons supported being about 1m2. That’s more accurate than most current available data sources, short of survey data.

Disadvantages

There are some disadvantages to be aware of. The first is that the world can not actually be divided up completely cleanly into hexagons. It turns out that a few pentagons are needed, and this introduces discontinuities. However the system has cleverly placed those pentagons far away from any land masses, so this is only practically a concern for specific maritime operations.

The second issue is that hexagons at adjacent scales do not nest exactly:

This doesn’t much affect practical operations at any single given scale. But if you look carefully at the California multiscale plot above you will discover tiny discontinuities in the form of gaps or overlaps. These don’t amount to a large percentage of the total area, but nonetheless mean this method is not appropriate when exact counts are required.

Supported Methods

geoToH3(longitude, latitude, h3_scale)

Encodes columnar point geometry into a globally-unique h3 cell ID for the specified h3_scale. Scales run from 0 to 15 inclusive, where 0 represents the coarsest resolution and 15 the finest. For details on h3 scales, please see the base library documentation.

This can be applied to literal values:

SELECT geoToH3(2.43817853854884, 48.8427101442789, 5) AS paris_hex05ql

Or to columnar geographic points:

SELECT geoToH3(raster_lon, raster_lon, 12) AS srtm_terrain_hex12 
FROM srtm_elevation

Note that if you have geographic point data rather than columnar latitude and longitude, you can use the ST_X and ST_Y functions. Also, if you wish to encode the centroids of polygons, such as for building footprints, you can combine this with the ST_CENTROID function.

From Hex Codes to Geometry

To retrieve geometric coordinates from an H3 code, two functions are available.

h3ToLat and h3ToLon extract the latitude and longitude respectively, for example:

SELECT h3ToLat(634464641634394600) as latitude
SELECT h3ToLon(634464641634394600) as longitude

Following Parent Relationships

Given an H3 code, the function h3ToParent is available to find cells above that cell at any hierarchical level. This means that once codes are computed at high resolution, they can be compared to codes at other scales.

SELECT h3ToParent(634464641634394600, 3) as level3_parent

H3 Usage Notes

Uber's h3 python library provides a wider range of functions than those available above (although at significantly slower performance). The library defaults to generating h3 codes as hexadecimal strings, but can be configured to support BIGINT codes. Please see Uber's documentation for details.

H3 codes can be used in regular joins, including joins in Immerse. They can also be used as aggregators, such as in Immerse custom dimensions. For points which are exactly aligned, such as imports from raster data bands of the same source, aggregating on H3 codes is faster than the exact geographic overlaps function ST_EQUALS

SQL Extensions

HEAVY.AI implements a number of custom extension functions to SQL.

Rendering

The following table describes SQL extensions available for the HEAVY.AI implementation of Vega.

SQL SELECT

Function

Arguments and Return

convert_meters_to_merc_pixel_width(meters, lon, lat, min_lon, max_lon, img_width, min_width)

Converts a distance in meters in a longitudinal direction from a latitude/longitude coordinate to a pixel size using mercator projection:

  • meters: Distance in meters in a longitudinal direction to convert to pixel units.

  • lon: Longitude coordinate of the center point to size from.

  • lat: Latitude coordinate of the center point to size from.

  • min_lon: Minimum longitude coordinate of the mercator-projected view.

  • max_lon: Maximum longitude coordinate of the mercator-projected view.

  • img_width: The width in pixels of the view.

  • min_width: Clamps the returned pixel size to be at least this width.

Returns: Floating-point value in pixel units. Can be used for the width of a symbol or a point in Vega.

convert_meters_to_merc_pixel_height(meters, lon, lat, min_lat, max_lat, img_height, min_height)

Converts a distance in meters in a latitudinal direction from a latitude/longitude coordinate to a pixel size, using mercator projection:

  • meters: Distance in meters in a longitudinal direction to convert to pixel units.

  • lon: Longitude coordinate of the center point to size from.

  • lat: Latitude coordinate of the center point to size from.

  • min_lat: Minimum latitude coordinate of the mercator-projected view.

  • max_lat: Maximum latitude coordinate of the mercator-projected view.

  • img_height: The height in pixels of the view.

  • min_height: Clamps the returned pixel size to be at least this height.

Returns: Floating-point value in pixel units. Can be used for the height of a symbol or a point in Vega.

convert_meters_to_pixel_width(meters, pt, min_lon, max_lon, img_width, min_width)

Converts a distance in meters in a longitudinal direction from a latitude/longitude POINT to a pixel size. Supports only mercator-projected points.

  • meters: Distance in meters in a latitudinal direction to convert to pixel units.

  • pt: The center POINT to size from. The point must be defined in the EPSG:4326 spatial reference system.

  • min_lon: Minimum longitude coordinate of the mercator-projected view.

  • max_lon: Maximum longitude coordinate of the mercator-projected view.

  • img_width: The width in pixels of the view.

  • min_width: Clamps the returned pixel size to be at least this width.

Returns: Floating-point value in pixel units. Can be used for the width of a symbol or a point in Vega.

convert_meters_to_pixel_height(meters, pt, min_lat, max_lat, img_height, min_height)

Converts a distance in meters in a latitudinal direction from an EPSG:4326 POINT to a pixel size. Currently only supports mercator-projected points:

  • meters: Distance in meters in a longitudinal direction to convert to pixel units.

  • pt: The center POINT to size from. The point must be defined in the EPSG:4326 spatial reference system.

  • min_lat: Minimum latitude coordinate of the mercator-projected view.

  • max_lat: Maximum latitude coordinate of the mercator-projected view.

  • img_height: The height in pixels of the view.

  • min_height: Clamps the returned pixel size to be at least this height.

Returns: Floating-point value in pixel units. Can be used for the height of a symbol or a point in Vega.

is_point_in_merc_view(lon, lat, min_lon, max_lon, min_lat, max_lat)

Returns true if a latitude/longitude coordinate is within a mercator-projected view defined by min_lon/max_lon, min_lat/max_lat.

  • lon: Longitude coordinate of the point.

  • lat: Latitude coordinate of the point.

  • min_lon: Minimum longitude coordinate of the mercator-projected view.

  • max_lon: Maximum longitude coordinate of the mercator-projected view.

  • min_lat: Minimum latitude coordinate of the mercator-projected view.

  • max_lat: Maximum latitude coordinate of the mercator-projected view.

Returns:True if the point is within the view defined by the min_lon/max_lon, min_lat/max_lat; otherwise, false.

is_point_size_in_merc_view(lon, lat, meters, min_lon, max_lon, min_lat, max_lat)

Returns true if a latitude/longitude coordinate, offset by a distance in meters, is within a mercator-projected view defined by min_lon/max_lon, min_lat/max_lat.

  • lon: Longitude coordinate of the point.

  • lat: Latitude coordinate of the point.

  • meters: Distance in meters to offset the point by, in any direction.

  • min_lon: Minimum longitude coordinate of the mercator-projected view.

  • max_lon: Maximum longitude coordinate of the mercator-projected view.

  • min_lat: Minimum latitude coordinate of the mercator-projected view.

  • max_lat: Maximum latitude coordinate of the mercator-projected view.

Returns: True if the point is within the view defined by the min_lon/max_lon, min_lat/max_lat; otherwise, false.

is_point_in_view(pt, min_lon, max_lon, min_lat, max_lat)

Returns true if a latitude/longitude POINT defined in EPSG:4326 is within a mercator-projected view defined by min_lon/max_lon, min_lat/max_lat.

  • pt: The POINT to check. Must be defined in EPSG:4326 spatial reference system.

  • min_lon: Minimum longitude coordinate of the mercator-projected view.

  • max_lon: Maximum longitude coordinate of the mercator-projected view.

  • min_lat: Minimum latitude coordinate of the mercator-projected view.

  • max_lat: Maximum latitude coordinate of the mercator-projected view.

Returns: True if the point is within the view defined by min_lon/max_lon, min_lat/max_lat; otherwise, false.

is_point_size_in_view(pt, meters, min_lon, max_lon, min_lat, max_lat)

Returns true if a latitude/longitude POINT defined in EPSG:4326 is within a mercator-projected view defined by min_lon/max_lon, min_lat/max_lat.

  • pt: The POINT to check. Must be defined in EPSG:4326 spatial reference system.

  • meters: Distance in meters to offset the point by, in any direction.

  • min_lon: Minimum longitude coordinate of the mercator-projected view.

  • max_lon: Maximum longitude coordinate of the mercator-projected view.

  • min_lat: Minimum latitude coordinate of the mercator-projected view.

  • max_lat: Maximum latitude coordinate of the mercator-projected view.

Returns: True if a latitude/longitude POINT defined in EPSG:4326, offset by a distance in meters, is within the view defined by min_lon/max_lon, min_lat/max_lat; otherwise, false.

A single-scale tessellation of California into Uber h3 hexagons
A multi-scale tesselation of California into Uber h3 hexagons
Pixel neighbors vary in distance
Hexagon neighbors are equidistant
Hexagons at varying scales do not nest cleanly
Ship traffic analysis
heavyai Python interface
Gpu Drawed Scatterplot
Cpu Drawed Bubble chart
Standard NVIDIA-SMI output shows the GPU visible in your container.
Gpu Drawed Scatterplot
Cpu Drawed Bubble chart
azure_resource_group.png
azure_vm
azure_configure_vm
azure_networking
Rendered chart of the output of tf_graph_shortest_paths_distances along an Eastern US time-traversal weighted edge graph. The shortest travel destinations are rendered in blue, and the furthest travel destinations in yellow.
Heavy Immerse dashboard showing a raw Tallahasse LiDAR dataset on the left, and a smoothed version on the right using the min z values for each 1-meter binned cell of the LiDAR data, Gaussian-smoothed over the neighboring 100 cells.
LiDAR data from downtown Tallahassee, FL, colored by Z-value
Computed similarity score for US airlines for 2008, where similarity is computed by the cosine similarity of the airports each airline departs from, weighted by the number of flights from that airport (using the first example query above, sans LIMIT). Dataset courtesy of the FAA.
The computed shortest path along a time-traversal weighted road edge graph for the eastern US.
Inline generation of slope-field using the above example query, showing the computed slopes over 90-meter binned Copernicus 30m DEM data. Note that this can be done in Immerse using a custom source, and optionally parametrized if desired. The direction of the slope (aspect) is indicated by the direction of the arrows.
Computed Mandelbrot set using the HEAVY.AI Vega demo
Result of the example query above, showing the shortest slope-weighted path between the Nepali planes and the peak of Mt. Everest. The path closely mirrors the actual climbing route used.
Gpu Drawed Scatterplot
Cpu Drawed Bubble chart
ui/enable_contour_chart
ui/enable_trial_mode

SELECT

The SELECT command returns a set of records from one or more tables.

query:
  |   WITH withItem [ , withItem ]* query
  |   {
          select
      }
      [ ORDER BY orderItem [, orderItem ]* ]
      [ LIMIT [ start, ] { count | ALL } ]
      [ OFFSET start { ROW | ROWS } ]

withItem:
      name
      [ '(' column [, column ]* ')' ]
      AS '(' query ')'

orderItem:
      expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST ]

select:
      SELECT [ DISTINCT ] [/*+ hints */]
          { * | projectItem [, projectItem ]* }    
      FROM tableExpression
      [ WHERE booleanExpression ]
      [ GROUP BY { groupItem [, groupItem ]* } ]
      [ HAVING booleanExpression ]
      [ WINDOW window_name AS ( window_definition ) [, ...] ]

projectItem:
      expression [ [ AS ] columnAlias ]
  |   tableAlias . *

tableExpression:
      tableReference [, tableReference ]*
  |   tableExpression [ ( LEFT ) [ OUTER ] ] JOIN tableExpression [ joinCondition ]

joinCondition:
      ON booleanExpression
  |   USING '(' column [, column ]* ')'

tableReference:
      tablePrimary
      [ [ AS ] alias ]

tablePrimary:
      [ catalogName . ] tableName
  |   '(' query ')'

groupItem:
      expression
  |   '(' expression [, expression ]* ')'

ORDER BY

  • Sort order defaults to ascending (ASC).

  • Sorts null values after non-null values by default in an ascending sort, before non-null values in a descending sort. For any query, you can use NULLS FIRST to sort null values to the top of the results or NULLS LAST to sort null values to the bottom of the results.

  • Allows you to use a positional reference to choose the sort column. For example, the command SELECT colA,colB FROM table1 ORDER BY 2 sorts the results on colB because it is in position 2.

Query Hints

HEAVY.AI provides various query hints for controlling the behavior of the query execution engine.

Syntax

SELECT /*+ hint */ FROM ...;

SELECT hints must appear first, immediately after the SELECT statement; otherwise, the query fails.

By default, a hint is applied to the query step in which it is defined. If you have multiple SELECT clauses and define a query hint in one of those clauses, the hint is applied only to the specific query step; the rest of the query steps are unaffected. For example, applying the /*+ cpu_mode */ hint affects only the SELECT clause in which it exists.

You can define a hint to apply to all query steps by prepending g_ to the query hint. For example, if you define /*+ g_cpu_mode */, CPU execution is applied to all query steps.

HEAVY.AI supports the following query hints.

The marker hint type represents a Boolean flag.

Hint
Details
Example

allow_loop_join

Enable loop joins.

SELECT /*+ allow_loop_join */ ...

cpu_mode

Force CPU execution mode.

SELECT /*+ cpu_mode */ ...

columnar_output

Enable columnar output for the input query.

SELECT /*+ columnar_output */ ...

disable_loop_join

Disable loop joins.

SELECT /*+ disable_loop_join */ ...

dynamic_watchdog

Enable dynamic watchdog.

SELECT /*+ dynamic_watchdog */ ...

dynamic_watchdog_off

Disable dynamic watchdog.

SELECT /*+ dynamic_watchdog_off */ ...

force_baseline_hash_join

Use the baseline hash join scheme by skipping the perfect hash join scheme, which is used by default.

SELECT /*+ force_baseline_hash_join */ ...

force_one_to_many_hash_join

Deploy a one-to-many hash join by skipping one-to-one hash join, which is used by default.

SELECT /*+ force_one_to_many_hash_join */ ...

keep_result

Add result set of the input query to the result set cache.

SELECT /*+ keep_result */ ...

keep_table_function_result

Add result set of the table function query to the result set cache.

SELECT /*+ keep_table_function_result */ ...

overlaps_allow_gpu_build

Use GPU (if available) to build an overlaps join hash table. (CPU is used by default.)

SELECT /*+ overlaps_allow_gpu_build */ ...

overlaps_no_cache

Skip adding an overlaps join hash table to the hash table cache.

SELECT /*+ overlaps_no_cache */ ...

rowwise_output

Enable row-wise output for the input query.

SELECT /*+ rowwise_output */ ...

watchdog

Enable watchdog.

SELECT /*+ watchdog */ ...

watchdog_off

Disable watchdog.

SELECT /*+ watchdog_off */ ...

The key-value pair type is a hint name and its value.

Hint
Details
Example

aggregate_tree_fanout

Defines a fan out of a tree used to compute window aggregation over frame. Depending on the frame size, the tree fanout affects the performance of aggregation and the tree construction for each window function with a frame clause.

  • Value type: INT

  • Range: 0-1024

SELECT /+* aggregate_tree_fanout(32) */ SUM(y) OVER (ORDER BY x ROWS BETWEEN ...) ...

loop_join_inner_table_max_num_rows

Set the maximum number of rows available for a loop join.

  • Value type: INT

  • Range: 0 < x

Set the maximum number of rows to 100: SELECT /+* loop_join_inner_table_max_num_rows(100) */ ...

max_join_hash_table_size

Set the maximum size of the hash table.

  • Value type: INT

  • Range: 0 < x

Set the maximum size of the join hash table to 100:

SELECT /+* max_join_hash_table_size(100) */ ...

overlaps_bucket_threshold

Set the overlaps bucket threshold.

  • Value type: DOUBLE

  • Range: 0-90

Set the overlaps threshold to 10:

SELECT /*+ overlaps_bucket_threshold(10.0) */ ...

overlaps_max_size

Set the maximum overlaps size.

  • Value type: INTEGER

  • Range: >=0

Set the maximum overlap to 10: SELECT /*+ overlaps_max_size(10.0) */ ...

overlaps_keys_per_bin

Set the number of overlaps keys per bin.

  • Value type: DOUBLE

  • Range: 0.0 < x < double::max

SELECT /+* overlaps_keys_per_bin(0.1) */ ...

query_time_limit

Set the maximum time for the query to run.

  • Value type: INTEGER

  • Range: >=0

SELECT /+* query_time_limit(1000) */ ...

Cross-Database Queries

In Release 6.4 and higher, you can run SELECT queries across tables in different databases on the same HEAVY.AI cluster without having to first connect to those databases. This enables more efficient storage and memory utilization by eliminating the need for table duplication across databases, and simplifies access to shared data and tables.

To execute queries against another database, you must have ACCESS privilege on that database, as well as SELECT privilege.

Example

Execute a join query involving a table in the current database and another table in the my_other_db database:

SELECT name, saleamt, saledate FROM my_other_db.customers AS c, sales AS s 
  WHERE c.id = s.customerid;

For more information, see .

SELECT
gcp_machinetype