Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Create and save, search, and modify and manipulate dashboards
Loading...
Loading...
Operating Systems
CentOS/RHEL 7.0 or later
Ubuntu 18.04 or later
Ubuntu 22.04 is not currently supported.
Additional Components
OpenJDK version 8 or higher
EPEL
wget
or curl
Kernel headers
Kernel development packages
log4j 2.15.0 or higher
NVIDIA hardware and software (for GPU installs only)
Hardware: Ampere, Turing, Volta, or Pascal series GPU cards. HEAVY.AI recommends that each GPU card in a server or distributed environment be of the same series.
Software:
NVIDIA CUDA drivers, version 470 or higher. Run nvidia-smi
to determine the currently running driver.
Up-to-date Vulkan drivers.
Supported web browsers (Enterprise Edition, Immerse). Latest stable release of:
Chrome
FireFox
Safari version 15.x or higher
Some features in Heavy Immerse are not supported in the Internet Explorer browser due to performance issues in IE. HEAVY.AI recommends that you use a different browser to experience the latest Immerse features.
Use of HEAVY.AI is subject to the terms of the HEAVY.AI End User License Agreement (EULA).
Learn how to use Immerse to gain new insights to your data with fast, responsive graphics and SQL queries.
Learn how to Install and configure your HEAVY.AI instance, then load data for analysis.
Learn how to extend HEAVY.AI with an integrated data science foundation and custom charts and interfaces. Contribute to the HEAVY.AI Core Open Source project.
For more complete release information, see the Release Notes.
HEAVY.AI continues to refine and extend the data connectors ecosystem. This release features general availability of data connectors for PostGreSQL, beta Immerse connectors for Snowflake and Redshift, and SQL support for Google BigQuery and Hive (beta). These managed data connections let you use HEAVY.AI as an acceleration platform, wherever your source data lives. Scheduling and automated caching ensure that from an end-user perspective, fast analytics are always running on the latest available data.
Immerse features four new chart types: Contour, Cross-section, Wind barb and Skew-t. While these are especially useful for atmospheric and geotechnical data visualization, Contour and Cross-section also have more general application.
Major improvements for time series analysis have been added. This includes time series comparison via window functions, and a large number of SQL window function additions and performance enhancements.
This release also includes two major architectural improvements:
The ability to perform cross-database queries in SQL, increasing flexibility across the board.
Render queries no longer block other GPU queries. In many use cases, renders can be significantly slower than other common queries. This should result in significant performance gains, particularly in map-heavy dashboards.
Chart animation through cross filter replay, allowing controlled playback of time-based data such as weather maps or GPS tracks.
You can now directly export your charts and dashboards as image files.
New control panel enables administrators to view the configuration of the system and easily access logs and system tables.
HeavyConnect now provides graphical Heavy Immerse support for Redshift, Snowflake, and PostGIS connections.
For CPU-only systems, mapping capabilities are improved with the introduction of multilayer CPU-rendered geo.
Numerous improvements to core SQL and geoSQL capabilities.
Support for string to numeric, timestamp, date, and time types with the new TRY_CAST operator.
Explicit and implicit cast support for numeric, timestamp, date, and time types.
Advanced string functions facilitate extraction of data from JSON and externally encoded string formats.
Improvements to COUNT DISTINCT reduces memory requirements considerably in cases with very large cardinalities or highly skewed data distributions.
Added MULTIPOINT and MULTILINESTRING geo types.
Convex and concave hull operators, allowing generation of polygons from points and multipoints. For example, you could generate polygons from clusters of GPS points.
Syntax and performance optimizations across all geometry types, table orderings, and commonly nested functions.
Significant functionality extension of window functions; define windows directly in temporal terms, which is particularly important in time series with missing observations. Window frame support allows improved control at the edges of windows.
Two new functions now support direct loading of LiDAR data: tf_point_cloud_metadata
quickly searches tile metadata and helps you find data to import, and tf_load_point_cloud
does the actual import importing.
Network graph analytics functions have been added. These can work on networks alone, including non-geographic networks, or can find the least-cost path along a geographic network.
New spatial aggregation and smoothing functions. Aggregations work particularly well with LiDAR data--for example to pass through only the highest point within an area to create building or canopy height maps. Smoothing helps with noisy datasets and can reveal larger-scale patterns while minimizing visual distractions.
Release 6.1.0 features more granular administrative monitoring dashboards based on logs. These have been accessible in an open format on the server side, and now they are available in Immerse, by specific dashboards, users, or queries. Intermediate and advanced SQL support continues to mature, with INSERT, window functions, and UNION ALL.
This release contains a number of user interface polish items requested by customers. Cartography now supports polygons with colorful borders and transparent fills. Table presentation has been enhanced in various ways, from alignment to zebra striping. And dashboard saving reminders have been scaled back, based on customer feedback.
The extension framework now features an enhanced “custom source” dialog, as well as new SQL commands to see installed extensions and their parameters. We introduce three new extensions. The first, tf_compute_dwell_times, reduces GPS event stream data volumes considerably while keeping relevant information. The others compute feature similarity scores and are very general.
This release also includes initial public betas of our PostgreSQL Immerse connector, and SQL support for COPY FROM ODBC database connections, making it easier to connect to your enterprise data.
This release features large advances in data access, including intelligent linking to enterprise data (HeavyConnect) and support for raster geodata. SQL support includes high-performance string functions, as well as enhancements to window functions and table unions. Performance improvements are noticeable across the product, including fundamental advances in rendering, query compilation, and data transport. Our system administration tools have been expanded with a new Admin Portal, as well as additional system tables supporting detailed diagnostics. Major strides in extensibility include new charting options and a new extensions framework (beta).
Rebranded platform from OmniSci to HEAVY.AI, with OmniSciDB now HeavyDB, OmniSci Render now HeavyRender, and OmniSci Immerse now Heavy Immerse.
HeavyConnect allows the HEAVY.AI platform to work seamlessly as an accelerator for data in other data lakes and data warehouses. For Release 6.0, CSV and Parquet files on local file systems and in S3 buckets can be linked or imported. Other SQL databases are also supported via ODBC (beta).
HeavyConnect enables users to specify a data refresh schedule, which ensures access to up-to-date data.
Heavy Immerse now supports import of dozens of raster data formats, including geoTIFF, geoJPEG , and PNG. HeavySQL now supports most any vector GIS file format.
Support is included for multidimensional arrays common in the sciences, including Grib2, NetCDF, and hd5.
Immerse now supports linking or import of files on the server filesystem (local or mounted). This help prevent slow data transfers when client bandwidth is limited.
File globbing and filtering allow import of thousands of files at once.
New Gauge chart for easy visualization of key metrics relative to target thresholds.
New landing page and Help Center.
Enhanced mapping workflows with automated column picking.
Support for a wide range of performant string operations using a new string dictionary translation framework, as well as the ability to on-the-fly dictionary encode none-encoded strings with a new ENCODE_TEXT operator.
Support for UNION ALL is now enabled by default, with significant performance improvements from the previous release (where it was beta flagged).
Significant functionality and performance improvements for window functions, including the ability to support expressions in PARTITION and ORDER clauses.
Parallel compilation of queries and a new multi-executor shared code cache provide up to 20% throughput/concurrency gains for interactive usage scenarios.
10X+ performance improvements in many cases for initial join queries via optimized Join Hash Table framework.
New result set recycler allows for expensive query sub-steps to be cached via the SQL hint /*+ keep_result */, which can significantly increase performance when a subquery is used across multiple queries.
Arrow execution endpoints now leverage the parallel execution framework, and Arrow performance has been significantly improved when high-cardinality dictionary-encoded text columns are returned
Introduces a novel polygon rendering algorithm that does not require pre-triangulated or pre-grouped polygons and can render dynamically generated geometry on the fly (via ST_Buffer). The new algorithm is comparable to its predecessor in terms of both performance and memory and enables optimizations and enhancements in future releases.
New binary transport protocol to Heavy Immerse that significantly increases performance and interactivity for large result sets
A new Admin Portal provides information on system resources usage and users.
System table support under a new information_schema
database, containing 10 new system tables providing system statistics and memory and storage utilization.
New system and user-defined UDF framework (beta), comprising both row (scalar) and table (UDTF) functions, including the ability to define fast UDFs via Numba Python using the RBC framework, which are then inlined into the HeavyDB compiled query code for performant CPU and GPU execution.
System-provided table functions include generate_series for easy numeric series generation, tf_geo_rasterize_slope for fast geospatial binning and slope/aspect computation over elevation data, and others, with more capabilities planned for future releases.
Leveraging the new table function framework, a new HeavyRF module (licensed separate) includes tf_rf_prop and tf_rf_prop_max_signal table functions for fast radio frequency signal propagation analysis and visualization.
New Iframe chart type in Heavy Immerse to allow easier addition of custom chart types. (BETA)
Row-level security (RLS) can be used by an administrator to apply security filtering to queries run as a user or with a role.
Support for import from dozens of image and raster file types, such as jpeg, png, geotiff, and ESRI grid, including remote files.
Significantly more performant, parallelized window functions, executing up to 10X faster than in Release 5.9.
Automatic use of columnar output (instead of the default row-wise output) for large projections, reducing query times by 5-10X in some cases.
Support for full set of ST_TRANSFORM SRIDs supported by geos/proj4 library.
Support for numerous vector GIS files (100+ formats supported by current GDAL release).
Support for multidimensional array import from formats common in science and meteorology.
Improved Table chart export to access all data represented by a Table chart.
Introduced dashboard-level named custom SQL.
Significant speedup for POINT and fixed-length array imports and CTAS/ITAS, generally 5-20X faster.
The PNG encoding step of a render request is no longer a blocking step, providing improvement to render concurrency.
Adds support to hide legacy chart types from add/edit chart menu in preparation for future deprecation (defaults to off).
BETA - Adds custom expressions to table columns, allowing for reusable custom dimensions and measures within a single dashboard (defaults to off).
BETA - Adds Crosslink feature with Crosslink Panel UI, allowing crossfilters to fire across different data sources within the same dashboard (defaults to off).
BETA - Adds Custom SQL Source support and Custom SQL Source Manager, allowing the creation of a data source as a SQL statement (defaults to off)
Parallel execution framework is on by default. Running with multiple executors allows parts of query evaluation, such as code generation and intermediate reductions, to be executed concurrently. Currently available for single-node deployments.
Spatial joins between geospatial point types using the ST_Distance operator are accelerated using the overlaps hash join framework, with speedups up to 100x compared to Release 5.7.1.
Significant performance gains for many query patterns through optimization of query code generation, particularly benefitting CPU queries.
Window functions can now be executed without a partition clause being specified (to signify a partition encompassing all rows in the table).
Window functions can now execute over tables with multiple fragments and/or shards.
Native support for ST_Transform between all UTM Zones and EPSG:4326 (Lon/Lat) and EPSG:900913 (Web Mercator).
ST_Equals support for geospatial columns.
Support for the ANSI SQL WIDTH_BUCKET operator for easier and more performant numeric binning, now also used in Immerse for all numeric histogram visualizations
The Vulkan backend renderer is now enabled by default. The legacy OpenGL renderer is still available as a fallback if there are blocking issues with Vulkan. You can disable the Vulkan renderer using the renderer-use-vulkan-driver=false
configuration flag.
Vulkan provides improved performance, memory efficiency, and concurrency.
You are likely to see some performance and memory footprint improvements with Vulkan in Release 5.8, most significantly in multi-GPU systems.
Support for file path regex filter and sort order when executing the COPY FROM command.
New ALTER SYSTEM CLEAR commands that enable clearing CPU or GPU memory from Immerse SQL Editor or any other SQL client.
Extensive enhancements to Immerse support for parameters. Parameters can now be used in chart column selectors, chart filters, chart titles, global filters, and dashboard titles. Dashboards can have parameter widgets embedded on them, side-by-side with charts. Parameter values are visible in chart axes/labels, legends, and tooltips, and you can toggle parameter visibility.
In Immerse Pointmap charts, you can specify which color-by attribute always render on top, which is useful for highlight anomalies in data.
Significantly faster and more accurate "lasso" tool filters geospatial data on Immerse Pointmap charts, leveraging native geospatial intersection operations.
Immerse 3D Pointmap chart and HTML support in text charts are available as a beta feature.
Airplane symbol shape has been added as a built-in mark type for the Vega rendering API.
Vega symbol and multi-GPU polygon renders have been made significantly faster.
User-interrupt of query kernels is now on by default. Queries can be interrupted using Ctrl + C in omnisql
, or by calling the interrupt API.
Parallel executors is in public beta (set with --num-executors
flag).
Support for APPROX_QUANTILE aggregate.
Support for default column values when creating a table and across all append endpoints, including COPY TO
, INSERT INTO TABLE SELECT
, INSERT
, and binary load APIs.
Faster and more robust ability to return result sets in Apache Arrow format when queried from a remote client (i.e. non-IPC).
More performant and robust high-cardinality group-by queries.
ODBC driver now supports Geospatial data types.
Custom SQL dimensions, measures, and filters can now be parameterized in Immerse, enabling more flexible and powerful scenario analysis, projections, and comparison use cases.
New angle measure added to Pointmap and Scatter charts, allowing orientation data to be visualized with wedge and arrow icons.
Custom SQL modal with validation and column name display now enabled across all charts in Immerse.
Significantly faster point-in-polygon joins through a new range join hash framework.
Approximate Median function support.
INSERT and INSERT FROM SELECT now support specification of a subset of columns.
Automatic metadata updates and vacuuming for optimizing space usage.
Significantly improved OmniSciDB startup time, as well as a number of significant load and performance improvements.
Improvements to line and polygon stroke rendering and point/symbol rendering.
Ability to set annotations on New Combo charts for different dimension/measure combinations.
New ‘Arrow-over-the-wire’ capability to deliver result sets in Apache Arrow format, with ~3x performance improvement over Thrift-based result set serialization.
Support for concurrent SELECT and UPDATE/DELETE queries for single-node installations
Initial OmniSci Render support for CPU-only query execution ("Query on CPU, render on GPU"), allowing for a wider set of deployment infrastructure choices.
Cap metadata stored on previous states of a table by using MAX_ROLLBACK_EPOCHS, improving performance for streaming and small batch load use cases and modulating table size on disk
Added initial compilation support for NVIDIA Ampere GPUs.
Improved performance for UPDATE and DELETE queries.
Improved the performance of filtered group-by queries on large-cardinality string columns.
Added SQL function SAMPLE_RATIO, which takes a proportion between 0 and 1 as an input argument and filters rows to obtain a sampling of a dataset.
Added support for exporting geo data in GeoJSON format.
Dashboard filter functionality is expanded, and filters can be saved as views.
You can perform bulk actions on the dashboard list.
New UI Setting panel in Immerse for customizing charts.
Tabbed dashboards.
SQL Editor now handles Vega JSON requests.
New Combo chart type in Immerse provides increased configurability and flexibility.
Immerse chart-specific filters and quick filters add increased flexibility and speed.
Updated Immerse Filter panel provides a Simple mode and Advanced mode for viewing and creating filters.
On multilayer charts, layer visibility can be set by zoom level.
Different map charts can be synced together for pan and zoom actions, regardless of data source.
Array support for the Array type over JDBC.
SELECT DISTINCT in UNION ALL is supported. (UNION ALL is prerelease and must be explicitly enabled.
Support for joins on DECIMAL types.
Performance improvements on CUDA GPUs, particularly Volta and Turing.
NULL support for geospatial types, including in ALTER TABLE ADD COLUMN.
SQL SHOW commands: SHOW TABLES, SHOW DATABASES, SHOW CREATE TABLE, and SHOW USER SESSIONS.
Ability to perform updates and deletes on temporary tables.
Updates to JDBC driver, including escape syntax handling for the fn keyword and added support to get table metadata.
Notable performance improvements, particularly for join queries, projection queries with order by and/or limit, queries with scalar subqueries, and multicolumn group-by queries.
Query interrupt capability improved to allow canceling long-running queries, also supports JDBC now.
Completely overhauled SQL Editor, including query formatting, snippets, history and more.
Database switching from within Immerse, as well as dashboard URLs that contain the database name.
Over 50% reduction in load times for the dashboards list initial load and search.
Cohort builder now supports count (# records) in aggregate filter.
Improved error handling and more meaningful error messages.
Custom logos can now be configured separately for light and dark themes.
Logos can be configured to deep-link to a specific URL.
Added support for UPDATE via JOIN with a subquery in the WHERE clause.
Initial support for TEMPORARY (that is, non-persistent) tables.
Improved performance for multi-column GROUP BY queries, as well as single column GROUP BY queries with high cardinality. Performance improvement varies depending on data volume and available hardware, but most use cases can expect a 1.5 to 2x performance increase over OmniSciDB 5.0.
Improved support for EXISTS and NOT EXISTS subqueries.
Added support for LINESTRING, POLYGON, and MULTIPOLYGON in user defined functions.
Immerse log-ins are fully sessionized and persist across page refreshes.
Pie chart now supports "All Others" and percentage labels.
Cohorts can now be built with aggregation-based filters.
New filter sets can be created through duplicating existing filter sets.
Dashboard URLs now link to individual filter sets.
The new filter panel in Immerse enables the ability to toggle filters on and off, and introduces Filter Sets to provide quick access to different sets of filters in one dashboard.
Immerse now supports using global and cross-filters to interactively build cohorts of interest, and the ability to apply a cohort as a dashboard filter, either within the existing filter set or in a new filter set.
Data Catalog, located within Data Import, is a repository of datasets that users can use to enhance existing analyses.
To see these new features in action, please watch this video from Converge 2019, where Rachel Wang demonstrates how you can use them.
Added support for binary dump and restore of database tables.
Added support for compile-time registered user-defined functions in C++, and experimental support for runtime user-defined SQL functions and table functions in Python via the Remote Backend Compiler.
Support for some forms of correlated subqueries.
Support for update via subquery, to allow for updating a table based on calculations performed on another table.
Multistep queries that generate large, intermediate result sets now execute up to 2.5x faster by leveraging new JIT code generator for reductions and optimized columnarization of intermediate query results.
Frontend-rendered choropleths now support the selection of base map layers.
This sitemap link is for the benefit of the search crawler.
In this section, you will find recipes to install HEAVY.AI platform and NVIDIA drivers using package manager like apt or tarball.
This is an end-to-end recipe for installing HEAVY.AI on a Red Hat Enterprise 8.x machine using CPU and GPU devices.
The order of these instructions is significant. To avoid problems, install each component in the order presented.
The same instructions can be used to install on RL / RHEL 9, which some minor modifications.
These instructions assume the following:
You are installing a "clean" Rocky Linux / RHEL 8 host machine with only the operating system.
Your HEAVY.AI host only runs the daemons and services required to support HEAVY.AI.
Your HEAVY.AI host is connected to the Internet.
Prepare your machine by updating your system and optionally enabling or configuring a firewall.
Update the entire system and reboot the system if needed.
Install the utilities needed to create HEAVY.AI repositories and download installation binaries.
Follow these instructions to install a headless JDK and configure an environment variable with a path to the library. The “headless” Java Development Kit does not provide support for keyboard, mouse, or display systems. It has fewer dependencies and is best suited for a server host. For more information, see https://openjdk.java.net.
Open a terminal on the host machine.
Install the headless JDK using the following command:
Create a group called heavyai
and a user named heavyai
, who will own HEAVY.AI software and data on the file system.
You can create the group, user, and home directory using the useradd
command with the --user-group
and --create-home
switches:
Set a password for the user using the passwd command.
Log in with the newly created user.
There are two ways to install the heavy.ai software
DNF Installation To install software using DNF's package manager, you can utilize DNF's package management capabilities to search for and then install the desired software. This method provides a convenient and efficient way to manage software installations and dependencies on your system.
Tarball Installation Installing via a tarball involves obtaining a compressed archive file (tarball) from the software's official source or repository. After downloading the tarball, you would need to extract its contents and follow the installation instructions provided by the software developers. This method allows for manual installation and customization of the software.
Using the DNF package manager for installation is highly recommended due to its ability to handle dependencies and streamline the installation process, making it a preferred choice for many users.
If your system includes NVIDIA GPUs but the drivers are not installed, it is advisable to install them before proceeding with the suite installation.
See Install NVIDIA Drivers and Vulkan on Rocky Linux and RHEL for details.
Create a DNF repository depending on the edition (Enterprise, Free, or Open Source) and execution device (GPU or CPU) you will use.
Add the GPG-key to the newly added repository.
Use DNF
to install the latest version of HEAVY.AI.
You can use the DNF package manager to list the available packages when installing a specific version of HEAVY.AI, such as when a multistep upgrade is necessary, or a specific version is needed for any other reason.
sudo
dnf --showduplicates
list
heavyai
Select the version needed from the list (e.g. 7.0.0) and install using the command.
sudo
dnf
install
heavyai-7.0.0_20230501_be4f51b048-1.x86_64
Let's begin by creating the installation directory.
Download the archive and install the latest version of the software. The appropriate archive is downloaded based on the edition (Enterprise, Free, or Open Source) and the device used for runtime.
Follow these steps to configure your HEAVY.AI environment.
For your convenience, you can update .bashrc with these environment variables
Although this step is optional, you will find references to the HEAVYAI_BASE and HEAVYAI_PATH variables. These variables contain the paths where configuration, license, and data files are stored and the location of the software installation. It is strongly recommended that you set them up.
Run the script that will initialize the HEAVY.AI services and database storage located in the systemd folder.
Accept the default values provided or make changes as needed.
This step will take a few minutes if you are installing a CUDA-enabled version of the software because the shaders must be compiled.
The script creates a data directory in $HEAVYAI_BASE/storage
(typically /var/lib/heavyai
) with the directories catalogs
, data
and log
, which will contain the metadata, the data of the database tables, and the log files from Immerse's web server and the database.
The log folder is particularly important for database administrators. It contains data about the system's health, performance, and user activities.
The first step to activate the system is starting HeavyDB and the Web Server service that Heavy Immerse needs. ¹
Heavy Immerse is not available in the OS Edition.
Start the services and enable the automatic startup of the service at reboot and start the HEAVY.AI services.
If a firewall is not already installed and you want to harden your system, install and start firewalld
.
To use Heavy Immerse or other third-party tools, you must prepare your host machine to accept incoming HTTP(S) connections. Configure your firewall for external access:
Most cloud providers use a different mechanism for firewall configuration. The commands above might not run in cloud deployments.
For more information, see https://fedoraproject.org/wiki/Firewalld?rd=FirewallD.
If you are on Enterprise or Free Edition, you need to validate your HEAVY.AI instance with your license key. You can skip this section if you are using Open Source Edition. ²
Copy your license key from the registration email message. If you have not received your license key, contact your Sales Representative or register for your 30-day trial here.
Connect to Heavy Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273
.
When prompted, paste your license key in the text box and click Apply.
Log into Heavy Immerse by entering the default username (admin
) and password (HyperInteractive
), and then click Connect.
The $HEAVYAI_BASE directory must be dedicated to HEAVYAI; do not set it to a directory shared by other packages.
To verify that everything is working, load some sample data, perform a heavysql
query, and generate a Pointmap using Heavy Immerse. ¹
HEAVY.AI ships with two sample datasets of airline flight information collected in 2008, and a census of New York City trees. To install sample data, run the following command.
Connect to HeavyDB by entering the following command in a terminal on the host machine (default password is HyperInteractive
):
anEnter a SQL query such as the following:
The results should be similar to the results below.
After installing Enterprise or Free Edition, check if Heavy Immerse is running as intended.
Connect to Heavy Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273
.
Log into Heavy Immerse by entering the default username (admin
) and password (HyperInteractive
), and then click Connect.
Create a new dashboard and a Scatter Plot to verify that backend rendering is working.
Click New Dashboard.
Click Add Chart.
Click SCATTER.
Click Add Data Source.
Choose the flights_2008_10k table as the data source.
Click X Axis +Add Measure.
Choose depdelay.
Click Y Axis +Add Measure.
Choose arrdelay.
Click Size +Add Measure.
Choose airtime.
Click Color +Add Measure.
Choose dest_state.
The resulting chart clearly demonstrates that there is a direct correlation between departure delay and arrival delay. This insight can help in identifying areas for improvement and implementing strategies to minimize delays and enhance overall efficiency.
Create a new dashboard and a Table chart to verify that Heavy Immerse is working.
Click New Dashboard.
Click Add Chart.
Click Bubble.
Click Select Data Source.
Choose the flights_2008_10k table as the data sour
Click Add Dimension.
Choose carrier_name.
Click Add Measure.
Choose depdelay.
Click Add Measure.
Choose arrdelay.
Click Add Measure.
Choose #Records.
The resulting chart shows, unsurprisingly, that also the average departure delay is correlated to the average of arrival delay, while there is quite a difference between Carriers.
Install the Extra Packages for Enterprise Linux (EPEL) repository and other packages before installing NVIDIA drivers.
RHEL-based distributions require Dynamic Kernel Module Support (DKMS) to build the GPU driver kernel modules. For more information, see https://fedoraproject.org/wiki/EPEL. Upgrade the kernel and restart the machine.
Install kernel headers and development packages:
If installing kernel headers does not work correctly, follow these steps instead:
Identify the Linux kernel you are using by issuing the uname -r
command.
Use the name of the kernel (4.18.0-553.el8_10.x86_64 in the following code example) to install kernel headers and development packages:
Install the dependencies and extra packages:
CUDA is a parallel computing platform and application programming interface (API) model. It uses a CUDA-enabled graphics processing unit (GPU) for general-purpose processing. The CUDA platform provides direct access to the GPU virtual instruction set and parallel computation elements. For more information on CUDA unrelated to installing HEAVY.AI, see https://developer.nvidia.com/cuda-zone. You can install drivers in multiple ways. This section provides installation information using the NVIDIA website or using dnf.
Although using the NVIDIA website is more time-consuming and less automated, you are assured that the driver is certified for your GPU. Use this method if you are not sure which driver to install. If you prefer a more automated method and are confident that the driver is certified, you can use the DNF package manager method.
Install the CUDA package for your platform and operating system according to the instructions on the NVIDIA website (https://developer.nvidia.com/cuda-downloads).
If you do not know the GPU model installed on your system, run this command:
The output shows the product type, series, and model. In this example, the product type is Tesla, the series is T (as Turing), and the model is T4.
Select the product type shown after running the command above.
Select the correct product series and model for your installation.
In the Operating System dropdown list, select Linux 64-bit.
In the CUDA Toolkit dropdown list, click a supported version (11.4 or higher).
Click Search.
On the resulting page, verify the download information and click Download.
Please check that the driver's version you download meets the HEAVI.AI minimum requirements.
Move the downloaded file to the server, change the permissions, and run the installation.
You might receive the following error during installation:
ERROR: The Nouveau kernel driver is currently in use by your system. This driver is incompatible with the NVIDIA driver, and must be disabled before proceeding. Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to correctly disable the Nouveau kernel driver.
If you receive this error, blacklist the Nouveau driver by editing the /etc/modprobe.d/blacklist-nouveau.conf
file, adding the following lines at the end:
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
Install a specific version of the driver for your GPU by installing the NVIDIA repository and using the DNF
package manager.
When installing the driver, ensure your GPU model is supported and meets the HEAVI.AI minimum requirements.
Add the NVIDIA network repository to your system.
Install the driver version needed with dnf
. For 8.0, the minimum version is 535.
To load the installed driver, you can run sudo modprobe nvidia
or nvidia-smi
commands, or , in case of driver upgrade, you can reboot your system to ensure that the new version of the driver is loaded using the command sudo reboot
Run the specified command to verify that your drivers are installed correctly and recognize the GPUs in your environment. Depending on your environment, you should see output confirming the presence of your NVIDIA GPUs and drivers. This verification step ensures that your system can identify and utilize the GPUs as intended.
If you encounter an error similar to the following, the NVIDIA drivers are likely installed incorrectly:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Please ensure that the latest NVIDIA driver is installed and running.
Please review the Install NVIDIA Drivers section and correct any errors.
The back-end renderer requires a Vulkan-enabled driver and the Vulkan library to work correctly. Without these components, the database cannot start without disabling the back-end renderer.
To ensure the Vulkan library and its dependencies are installed, use the DNF.
For more information about troubleshooting Vulkan, see the Vulkan Renderer section.
You must install the CUDA Toolkit if you use advanced features like C++ User-Defined Functions or User-Defined Table Functions to extend the database capabilities.
Add the NVIDIA network repository to your system:
2. List the available CUDA Toolkit versions using the DNF list command
3. Install the CUDA Toolkit version using DNF.
4. Check that everything is working correctly:
HEAVY.AI is an analytics platform designed to handle very large datasets. It leverages the processing power of GPUs alongside traditional CPUs to achieve very high performance. HEAVY.AI combines an open-source SQL engine (HeavyDB), server-side rendering (HeavyRender), and web-based data visualization (Heavy Immerse) to provide a comprehensive platform for data analysis.
The foundation of the platform is HeavyDB, an open-source, GPU-accelerated database. HeavyDB harnesses GPU processing power and returns SQL query results in milliseconds, even on tables with billions of rows. HeavyDB delivers high performance with rapid query compilation, query vectorization, and advanced memory management.
With native SQL support, HeavyDB returns query results hundreds of times faster than CPU-only analytical database platforms. Use your existing SQL knowledge to query data. You can use the standalone SQL engine with the command line, or the SQL editor that is part of the Heavy Immerse visual analytics interface. Your SQL query results can output to Heavy Immerse or to third-party software such as Birst, Power BI, Qlik, or Tableau.
HeavyDB can store and query data using native Open Geospatial Consortium (OGC) types, including POINT, LINESTRING, POLYGON, and MULTIPOLYGON. With geo type support, you can query geo data at scale using special geospatial functions. Using the power of GPU processing, you can quickly and interactively calculate distances between two points and intersections between objects.
HeavyDB is open source and encourages contribution and innovation from a global community of users. It is available on Github under the Apache 2.0 license, along with components like a Python interface (heavyai) and JavaScript infrastructure (mapd-connector, mapd-charting), making HEAVY.AI the leader in open-source analytics.
HeavyRender works on the server side, using GPU buffer caching, graphics APIs, and a Vega-based interface to generate custom pointmaps, heatmaps, choropleths, scatterplots, and other visualizations. HEAVY.AI enables data exploration by creating and sending lightweight PNG images to the web browser, avoiding high-volume data transfers. Fast SQL queries make metadata in the visualizations appear as if the data exists on the browser side.
Network bandwidth is a bottleneck for complex chart data, so HEAVY.AI uses in-situ rendering of on-GPU query results to accelerate visual rendering. This differentiates HEAVY.AI from systems that execute queries quickly but then transfer the results to the client for rendering, which slows performance.
Efficient geospatial analysis requires fast data-rendering of complex shapes on a map. HEAVY.AI can import and display millions of lines or polygons on a geo chart with minimal lag time. Server-side rendering technology prevents slowdowns associated with transferring data over the network to the client. You can select location shapes down to a local level, like census tracts or building footprints, and cross-filter interactively.
Complex server-side visualizations are specified using an adaptation of the Vega Visualization Grammar. Heavy Immerse generates Vega rendering specifications behind the scenes; however, you can also generate custom visualizations using the same API. This customizable visualization system combines the agility of a lightweight frontend with the power of a GPU engine.
Heavy Immerse is a web-based data visualization interface that uses HeavyDB and HeavyRender for visual interaction. Intuitive and easy to use, Heavy Immerse provides standard visualizations, such as line, bar, and pie charts, as well as complex data visualizations, such as geo point maps, geo heat maps, choropleths, and scatter plots. Heavy Immerse provides quick insights and makes them easy to recognize.
Use dashboards to create and organize your charts. Dashboards automatically cross-filter when interacting with data, and refresh with zero latency. You can create dashboards and interact with conventional charts and data tables, as well as scatterplots and geo charts created by HeavyRender. You can also create your own queries in the SQL editor.
Heavy Immerse lets you create a variety of different chart types. You can display pointmaps, heatmaps, and choropleths alongside non-geographic charts, graphs, and tables. When you zoom into any map, visualizations refresh immediately to show data filtered by that geographic context. Multiple sources of geographic data can be rendered as different layers on the same map, making it easy to find the spatial relationships between them.
Create geo charts with multiple layers of data to visualize the relationship between factors within a geographic area. Each layer represents a distinct metric overlaid on the same map. Those different metrics can come from the same or a different underlying dataset. You can manipulate the layers in various ways, including reorder, show or hide, adjust opacity, or add or remove legends.
Heavy Immerse can visually display dozens of datasets in the same dashboard, allowing you to find multi-factor relationships that you might not otherwise consider. Each chart (or groups of charts) in a dashboard can point to a different table, and filters are applied at the dataset level. Multisource dashboards make it easier to quickly compare across datasets, without merging the underlying tables.
Heavy Immerse is ideal for high-velocity data that is constantly streaming; for example, sensor, clickstream, telematics, or network data. You can see the latest data to spot anomalies and trend variances rapidly. Immerse auto-refresh automatically updates dashboards at flexible intervals that you can tailor to your use case.
I want to...
See...
Install HEAVY.AI
Upgrade to the latest version
Configure HEAVY.AI
See some tutorials and demos to help get up and running
Learn more about charts in Heavy Immerse
Use HEAVY.AI in the cloud
See what APIs work with HEAVY.AI
Learn about features and resolved issues for each release
Know what issues and limitations to look out for
See answers to frequently asked questions
This is an end-to-end recipe for installing HEAVY.AI on a Ubuntu 18.04/20.04 machine using CPU and GPU devices.
The order of these instructions is significant. To avoid problems, install each component in the order presented.
These instructions assume the following:
You are installing on a “clean” Ubuntu 18.04/20.04 host machine with only the operating system installed.
Your HEAVY.AI host only runs the daemons and services required to support HEAVY.AI.
Your HEAVY.AI host is connected to the Internet.
Prepare your Ubuntu machine by updating your system, creating the HEAVY.AI user (named heavyai), installing kernel headers, installing CUDA drivers, and optionally enabling the firewall.
Update the entire system:
2. Install the utilities needed to create Heavy.ai repositories and download archives:
3. Install the headless JDK and the utility apt-transport-https
:
4. Reboot to activate the latest kernel:
Create a group called heavyai
and a user named heavyai
, who will be the owner of the HEAVY.AI software and data on the filesystem.
Create the group, user, and home directory using the useradd
command with the --user-group
and --create-home
switches.
2. Set a password for the user:
3. Log in with the newly created user:
Install the HEAVY.AI using APT and a tarball.
The installation using the APT package manager is recommended to those who want a more automated install and upgrade procedure.
Download and add a GPG key to APT.
Add a source apt depending on the edition (Enterprise, Free, or Open Source) and execution device (GPU or CPU) you are going to use.
Use apt
to install the latest version of HEAVY.AI.
If you need to install a specific version of HEAVY.AI, because you are upgrading from Omnisci or for different reasons, you must run the following command:
First create the installation directory.
Download the archive and install the software. A different archive is downloaded depending on the Edition (Enterprise, Free, or Open Source) and the device used for runtime (GPU or CPU).
Follow these steps to prepare your HEAVY.AI environment.
For convenience, you can update .bashrc with these environment variables
Although this step is optional, you will find references to the HEAVYAI_BASE and HEAVYAI_PATH variables. These variables contain respectively the paths where configuration, license, and data files are stored and where the software is installed. Setting them is strongly recommended.
Run the systemd
installer to create heavyai services, a minimal config file, and initialize the data storage.
Accept the default values provided or make changes as needed.
The script creates a data directory in $HEAVYAI_BASE/storage
(default /var/lib/heavyai/storage
) with the directories catalogs
, data
, export
and log
.The import
directory is created when you insert data the first time. If you are HEAVY.AI administrator, the log
directory is of particular interest.
Heavy Immerse is available in OS Edition, so the systemctl
command using the heavy_web_server
has no effect.
Enable the automatic startup of the service at reboot and start the HEAVY.AI services.
If a firewall is not already installed and you want to harden your system, install theufw
.
To use Heavy Immerse or other third-party tools, you must prepare your host machine to accept incoming HTTP(S) connections. Configure your firewall for external access.
Most cloud providers use a different mechanism for firewall configuration. The commands above might not run in cloud deployments.
If you are using Enterprise or Free Edition, you need to validate your HEAVY.AI instance with your license key.
Copy your license key of Enterprise or Free Edition from the registration email message. If you do not have a license and you want to evaluate HEAVI.AI in an unlimited
Connect to Heavy Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273
.
When prompted, paste your license key in the text box and click Apply.
Log into Heavy Immerse by entering the default username (admin
) and password (HyperInteractive
), and then click Connect.
.
HEAVY.AI ships with two sample datasets of airline flight information collected in 2008, and a census of New York City trees. To install sample data, run the following command.
Connect to HeavyDB by entering the following command in a terminal on the host machine (default password is HyperInteractive):
Enter a SQL query such as the following
The results should be similar to the results below.
After installing Enterprise or Free Edition, check if Heavy Immerse is running as intended.
Connect to Heavy Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273
.
Log into Heavy Immerse by entering the default username (admin
) and password (HyperInteractive
), and then click Connect.
Create a new dashboard and a Scatter Plot to verify that backend rendering is working.
Click New Dashboard.
Click Add Chart.
Click SCATTER.
Click Add Data Source.
Choose the flights_2008_10k table as the data source.
Click X Axis +Add Measure.
Choose depdelay.
Click Y Axis +Add Measure.
Choose arrdelay.
Click Size +Add Measure.
Choose airtime.
Click Color +Add Measure.
Choose dest_state.
The resulting chart shows, unsurprisingly, that there is a correlation between departure delay and arrival delay.
Create a new dashboard and a Table chart to verify that Heavy Immerse is working.
Click New Dashboard.
Click Add Chart.
Click Bubble.
Click Select Data Source.
Choose the flights_2008_10k table as the data source.
Click Add Dimension.
Choose carrier_name.
Click Add Measure.
Choose depdelay.
Click Add Measure.
Choose arrdelay.
Click Add Measure.
Choose #Records.
The resulting chart shows, unsurprisingly, that also the average departure delay is correlated to the average of arrival delay, while there is quite a difference between Carriers.
The amount of data you can process with the HEAVY.AI database depends primarily on the amount of GPU RAM and CPU RAM available across HEAVY.AI cluster servers. For zero-latency queries, the system caches compressed versions of the row- and column-queried fields into GPU RAM. This is called hot data (see ). Semi-hot data utilizes CPU RAM for certain parts of the data.
show example configurations to help you configure your system.
Optimal GPUs on which to run the HEAVY.AI platform include:
NVIDIA Tesla A100
NVIDIA Tesla V100 v2
NVIDIA Tesla V100 v1
NVIDIA Tesla P100
NVIDIA Tesla P40
NVIDIA Testa T4
The following configurations are valid for systems using any of these GPUs as the building blocks of your system. For production systems, use Tesla enterprise-grade cards. Avoid mixing card types in the same system; use a consistent card model across your environment.
Primary factors to consider when choosing GPU cards are:
The amount of GPU RAM available on each card
The number of GPU cores
Memory bandwidth
Newer cards like the Tesla V100 have higher double-precision compute performance, which is important in geospatial analytics. The Tesla V100 models support the NVLink interconnect, which can provide a significant speed increase for some query workloads.
For advice on optimal GPU hardware for your particular use case, ask your HEAVY.AI sales representative.
Before considering hardware details, this topic describes the HeavyDB architecture.
HeavyDB is a hybrid compute architecture that utilizes GPU, CPU, and storage. GPU and CPU are the Compute Layer, and SSD storage is the Storage Layer.
When determining the optimal hardware, make sure to consider the storage and compute layers separately.
Loading raw data into HeavyDB ingests data onto disk, so you can load as much data as you have disk space available, allowing some overhead.
When queries are executed, HeavyDB optimizer utilizes GPU RAM first if it is available. You can view GPU RAM as an L1 cache conceptually similar to modern CPU architectures. HeavyDB attempts to cache the hot data. If GPU RAM is unavailable or filled, HeavyDB optimizer utilizes CPU RAM (L2). If both L1 and L2 are filled, query records overflow to disk (L3). To minimize latency, use SSDs for the Storage Layer.
You can run a query on a record set that spans both GPU RAM and CPU RAM as shown in the diagram above, which also shows the relative performance improvement you can expect based on whether the records all fit into L1, a mix of L1 and L2, only L2, or some combination of L1, L2, and L3.
The server is not limited to any number of hot records. You can store as much data on disk as you want. The system can also store and query records in CPU RAM, but with higher latency. The hot records represent the number of records on which you can perform zero-latency queries.
The amount of CPU RAM should equal four to eight times the amount of total available GPU memory. Each NVIDIA Tesla P40 has 24 GB of onboard RAM available, so if you determine that your application requires four NVIDIA P40 cards, you need between 4 x 24 GB x 4
(384 GB) and 4 x 24 GB x 8
(768 GB) of CPU RAM. This correlation between GPU RAM and CPU RAM exists because the HeavyDB uses CPU RAM in certain operations for columns that are not filtered or aggregated.
A HEAVY.AI deployment should be provisioned with enough SSD storage to reliably store the required data on disk, both in compressed format and in HEAVY.AI itself. HEAVY.AI requires 30% overhead beyond compressed data volumes. HEAVY.AI recommends drives such as the Intel® SSD DC S3610 Series, or similar, in any size that meets your requirements.
For maximum ingestion speed, HEAVY.AI recommends ingesting data from files stored on the HEAVY.AI instance.
Most public cloud environments’ default storage is too small for the data volume HEAVY.AI ingests. Estimate your storage requirements and provision accordingly.
If you already have your data in a database, you can look at the largest fact table, get a count of those records, and compare that with this schedule.
If you have a .csv file, you need to get a count of the number of lines and compare it with this schedule.
HEAVY.AI uses the CPU in addition to the GPU for some database operations. GPUs are the primary performance driver; CPUs are utilized secondarily. More cores provide better performance but increase the cost. Intel CPUs with 10 cores offer good performance for the price. For example, so you could configure your system with a single NVIDIA P40 GPU and two 10-core CPUs. Similarly, you can configure a server with eight P40s and two 10-core CPUs.
Suggested CPUs:
Intel® Xeon® E5-2650 v3 2.3GHz, 10 cores
Intel® Xeon® E5-2660 v3 2.6GHz, 10 cores
Intel® Xeon® E5-2687 v3 3.1GHz, 10 cores
Intel® Xeon® E5-2667 v3 3.2GHz, 8 cores
GPUs are typically connected to the motherboard using PCIe slots. The PCIe connection is based on the concept of a lane, which is a single-bit, full-duplex, high-speed serial communication channel. The most common numbers of lanes are x4, x8, and x16. The current PCIe 3.0 version with an x16 connection has a bandwidth of 16 GB/s. PCIe 2.0 bandwidth is half the PCIe 3.0 bandwidth, and PCIe 1.0 is half the PCIe 2.0 bandwidth. Use a motherboard that supports the highest bandwidth, preferably, PCIe 3.0. To achieve maximum performance, the GPU and the PCIe controller should have the same version number.
The PCIe specification permits slots with different physical sizes, depending on the number of lanes connected to the slot. For example, a slot with an x1 connection uses a smaller slot, saving space on the motherboard. However, bigger slots can actually have fewer lanes than their physical designation. For example, motherboards can have x16 slots connected to x8, x4, or even x1 lanes. With bigger slots, check to see if their physical sizes correspond to the number of lanes. Additionally, some slots downgrade speeds when lanes are shared. This occurs most commonly on motherboards with two or more x16 slots. Some motherboards have only 16 lanes connecting the first two x16 slots to the PCIe controller. This means that when you install a single GPU, it has the full x16 bandwidth available, but two installed GPUs each have x8 bandwidth.
HEAVY.AI does not recommend adding GPUs to a system that is not certified to support the cards. For example, to run eight GPU cards in a machine, the BIOS register the additional address space required for the number of cards. Other considerations include power routing, power supply rating, and air movement through the chassis and cards for temperature control.
NVLink is a bus technology developed by NVIDIA. Compared to PCIe, NVLink offers higher bandwidth between host CPU and GPU and between the GPU processors. NVLink-enabled servers, such as the IBM S822LC Minsky server, can provide up to 160 GB/sec bidirectional bandwidth to the GPUs, a significant increase over PCIe. Because Intel does not currently support NVLink, the technology is available only on IBM Power servers. Servers like the NVIDIA-manufactured DGX-1 offer NVLink between the GPUs but not between the host and the GPUs.
A variety of hardware manufacturers make suitable GPU systems. For more information, follow these links to their product specifications.
Upgrade the system and the kernel, then the machine if needed.
Install kernel headers and development packages.
Install the extra packages.
The rendering engine of HEAVY.AI (present in Enterprise Editions) requires a Vulkan-enabled driver and the Vulkan library. Without these components, the database itself may not be able to start.
Install the Vulkan library and its dependencies using apt
.
For more information about troubleshooting Vulkan, see the section.
Installing NVIDIA drivers with support for the CUDA platform is required to run GPU-enabled versions of HEAVY.AI.
You can install NVIDIA drivers in multiple ways, we've outlined three available options below. If you would prefer not to decide, we recommend Option 1.
It is advisable to keep a record of the installation method used, as upgrading NVIDIA drivers will require the utilization of the same method for successful results.
The CUDA Toolkit from NVIDIA provides everything you need to develop GPU-accelerated applications. The CUDA Toolkit includes GPU-accelerated libraries, a compiler, development tools and the CUDA runtime. The CUDA Toolkit is not required to run HEAVY.AI, but you must install the CUDA toolkit if you use advanced features like C++ User-Defined Functions and or User-Defined Table Functions to extend the database capabilities.
The minimum CUDA version supported by HEAVY.AI is 11.4. We recommend using a release that has been available for at least two months.
In the "Target Platform" section, follow these steps:
For "Operating System" select Linux
For Architecture" select x86_64
For "Distribution" select Ubuntu
For "Version" select the version of your operating system (18.04 or 20.04)
For "Installer Type" choose deb (network) **
One by one, run the presented commands in the Installer Instructions section on your server.
** You may optionally use any of the "Installer Type" options available.
If you choose to use the .run file option, prior to running the installer you will need to manually install build-essentials
using apt
and change permissions of the downloaded .run file to allow execution.
If you don't know the exact GPU model in your system run this command
You'll get an output in the format Product Type, Series and Model
In this example, the Product type is Tesla the Series is T (as Turing), and the model is T4.
Select the Product Type as the one you got with the command.
Select the correct Product Series and Product Type for your installation.
In the Operating System dropdown list, select Linux 64-bit.
In the CUDA Toolkit dropdown list, click a supported version (11.4 or higher).
Click Search.
On the resulting page, verify the download information and click Download
On the subsequent page, if you agree to the terms, right click on "Agree and Download" and select "Copy Link Address". You may also manually download and transfer to your server, skipping the next step.
On your server, type wget
and paste the URL you copied in the previous step. Press enter to download.
Install the tools needed for installation.
Change the permissions of the downloaded .run file to allow execution, and run the installation.
Install a specific version of the driver for your GPU by installing the NVIDIA repository and using the apt
package manager.
Run the command to get a list of the available driver's version
Install the driver version needed with apt
Reboot your system to ensure the new version of the driver is loaded
Run nvidia-smi
to verify that your drivers are installed correctly and recognize the GPUs in your environment. Depending on your environment, you should see something like this to confirm that your NVIDIA GPUs and drivers are present.
If you see an error like the following, the NVIDIA drivers are probably installed incorrectly:
Review the installation instructions, specifically checking for completion of install prerequisites, and correct any errors.
The rendering engine of HEAVY.AI requires a Vulkan-enabled driver and the Vulkan library. Without these components, the database itself can't even start without disabling the back-end renderer.
Install the Vulkan library and its dependencies using apt
.
You must install the CUDA toolkit and Clang if you use advanced features like C++ User-Defined Functions and or User-Defined Table Functions to extend the database capabilities.
Install the NVIDIA public repository GPG key.
Add the repository.
List the available Cuda toolkit versions.
Install the CUDA toolkit using apt
.
Check that everything is working and the toolkit has been installed.
You must install Clang if you use advanced features like C++ User-Defined Functions and or User-Defined Table Functions to extend the database capabilities. Install Clang and LLVM dependencies using apt
.
Check that the software is installed and in the execution path.
Follow these steps to install HEAVY.AI as a Docker container on a machine running with on CPU or with supported NVIDIA GPU cards using Ubuntu as the host OS.
Prepare your host by installing Docker and if needed for your configuration NVIDIA drivers and NVIDIA runtime.
Remove any existing Docker Installs and if on GPU the legacy NVIDIA docker runtime.
Use curl
to add the docker's GPG key.
Add Docker to your Apt repository.
Update your repository.
Install Docker, the command line interface, and the container runtime.
Run the following usermod
command so that docker command execution does not require sudo privilege. Log out and log back in for the changes to take effect. (reccomended)
Verify your Docker installation.
Use curl
to add Nvidia's Gpg key:
Update your sources list:
Update apt-get and install nvidia-container-runtime:
Edit /etc/docker/daemon.json to add the following, and save the changes:
Restart the Docker daemon:
Verify that docker and NVIDIA runtime work together.
If everything is working you should get the output of nvidia-smi command showing the installed GPUs in the system.
Create a directory to store data and configuration files
Then a minimal configuration file for the docker installation
Ensure that you have sufficient storage on the drive you choose for your storage dir running this command
Download HEAVY.AI from DockerHub and Start HEAVY.AI in Docker. Select the tab depending on the Edition (Enterprise, Free, or Open Source) and execution Device (GPU or CPU) you are going to use.
Check that the docker is up and running a docker ps commnd:
You should see an output similar to the following.
If a firewall is not already installed and you want to harden your system, install theufw
.
To use Heavy Immerse or other third-party tools, you must prepare your host machine to accept incoming HTTP(S) connections. Configure your firewall for external access.
Most cloud providers use a different mechanism for firewall configuration. The commands above might not run in cloud deployments.
Connect to Heavy Immerse using a web browser to your host on port 6273. For example, http://heavyai.mycompany.com:6273
.
When prompted, paste your license key in the text box and click Apply.
Log into Heavy Immerse by entering the default username (admin
) and password (HyperInteractive
), and then click Connect.
You can access the command line in the Docker image to perform configuration and run HEAVY.AI utilities.
You need to know the container-id
to access the command line. Use the command below to list the running containers.
You see output similar to the following.
Once you have your container ID, in the example 9e01e520c30c, you can access the command line using the Docker exec command. For example, here is the command to start a Bash session in the Docker instance listed above. The -it
switch makes the session interactive.
You can end the Bash session with the exit
command.
HEAVY.AI ships with two sample datasets of airline flight information collected in 2008, and a census of New York City trees. To install sample data, run the following command.
Where <container-id> is the container in which HEAVY.AI is running.
When prompted, choose whether to insert dataset 1 (7,000,000 rows), dataset 2 (10,000 rows), or dataset 3 (683,000 rows). The examples below use dataset 2.
Connect to HeavyDB by entering the following command (a password willò be asked; the default password is HyperInteractive):
Enter a SQL query such as the following:
The results should be similar to the results below.
Installing Enterprise or Free Edition, check if Heavy Immerse is running as intended.
Connect to Heavy Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273
.
Log into Heavy Immerse by entering the default username (admin
) and password (HyperInteractive
), and then click Connect.
Create a new dashboard and a Scatter Plot to verify that backend rendering is working.
Click New Dashboard.
Click Add Chart.
Click SCATTER.
Click Add Data Source.
Choose the flights_2008_10k table as the data source.
Click X Axis +Add Measure.
Choose depdelay.
Click Y Axis +Add Measure.
Choose arrdelay.
Click Size +Add Measure.
Choose airtime.
Click Color +Add Measure.
Choose dest_state.
The resulting chart shows, unsurprisingly, that there is a correlation between departure delay and arrival delay.
Create a new dashboard and a Table chart to verify that Heavy Immerse is working.
Click New Dashboard.
Click Add Chart.
Click Bubble.
Click Select Data Source.
Choose the flights_2008_10k table as the data sour
Click Add Dimension.
Choose carrier_name.
Click Add Measure.
Choose depdelay.
Click Add Measure.
Choose arrdelay.
Click Add Measure.
Choose #Records.
The resulting chart shows, unsurprisingly, that also the average departure delay is correlated to the average of arrival delay, while there is quite a difference between Carriers.
If your system uses NVIDIA GPUs, but the drivers not installed, install them now. See for details.
Start and use HeavyDB and Heavy Immerse.
For more information, see .
Skip this section if you are on Open Source Edition
enterprise environment, contact your Sales Representative or register for your 30-day trial of Enterprise Edition . If you need a Free License you can get one .
To verify that everything is working, load some sample data, perform a heavysql
query, and generate a Pointmap using Heavy Immerse
The table refers to hot records, which are the number of records that you want to put into GPU RAM to get zero-lag performance when querying and interacting with the data. The Hardware Sizing Schedule assumes 16 hot columns, which is the number of columns involved in the predicate or computed projections (such as, column1 / column2) of any one of your queries. A 15 percent GPU RAM overhead is reserved for rendering buffering and intermediate results. If your queries involve more columns, the number of records you can put in GPU RAM decreases, accordingly.
HeavyDB does not require all queried columns to be processed on the GPU. Non-aggregate projection columns, such as SELECT x, y FROM table
, do not need to be processed on the GPU, so can be stored in CPU RAM. The CPU RAM sizing assumes that up to 24 columns are used in only non-computed projections, in addition to the .
This schedule estimates the number of records you can process based on GPU RAM and CPU RAM sizes, assuming up to 16 hot columns (see ). This applies to the compute layer. For the storage layer, provision your application according to guidelines.
HEAVY.AI recommends installing GPUs in motherboards with support for as much PCIe bandwidth as possible. On modern Intel chip sets, each socket (CPU) offers 40 lanes, so with the correct motherboards, each GPU can receive x8 of bandwidth. All recommended have motherboards designed for maximizing PCIe bandwidth to the GPUs.
For an emerging alternative to PCIe, see .
: Install NVIDIA drivers with CUDA toolkit from NVIDIA Website
: Install NVIDIA drivers via .run file using the NVIDIA Website
: Install NVIDIA drivers using APT package manager
CUDA is a parallel computing platform and application programming interface (API) model. It uses a CUDA-enabled graphics processing unit (GPU) for general-purpose processing. The CUDA platform provides direct access to the GPU virtual instruction set and parallel computation elements. For more information on CUDA unrelated to installing HEAVY.AI, see .
Open and select the desired CUDA Toolkit version to install.
Install the CUDA package for your platform and operating system according to the instructions on the NVIDIA website ().
Please check that the driver's version you are downloading meets the HEAVI.AI
Be careful when choosing the driver version to install. Ensure that your GPU's model is supported and that meets the HEAVI.AI
For more information about troubleshooting Vulkan, see the section.
If you installed NVIDIA drivers using above, the CUDA toolkit is already installed; you may proceed to the verification step below.
For more information, see C++ .
For more information on Docker installation, see the .
Install NVIDIA driver and Cuda Toolkit using
See also the note regarding the in Optimizing Performance.
For more information, see .
If you are on Enterprise or Free Edition, you need to validate your HEAVY.AI instance using your license key. You must skip this section if you are on Open Source Edition
Copy your license key of Enterprise or Free Edition from the registration email message. If you don't have a license and you want to evaluate HEAVY.AI in an enterprise environment, contact your Sales Representative or register for your 30-day trial of Enterprise Edition . If you need a Free License you can get one .
To verify that everything is working, load some sample data, perform a heavysql
query, and generate a Scatter Plot or a Bubble Chart using Heavy Immerse
GPU Count
GPU RAM (GB)
CPU RAM (GB)
“Hot” Records
(NVIDIA P40)
8x GPU RAM
L1
1
24
192
417M
2
48
384
834M
3
72
576
1.25B
4
96
768
1.67B
5
120
960
2.09B
6
144
1,152
2.50B
7
168
1,344
2.92B
8
192
1,536
3.33B
12
288
2,304
5.00B
16
384
3,456
6.67B
20
480
3,840
8.34B
24
576
4,608
10.01B
28
672
5,376
11.68B
32
768
6,144
13.34B
40
960
7,680
16.68B
48
1,152
9,216
20.02B
56
1,344
10,752
23.35B
64
1,536
12,288
26.69B
128
3,072
24,576
53.38B
256
6,144
49,152
106.68B
GPU
Memory/GPU
Cores
Memory Bandwidth
NVLink
A100
40 to 80 GB
6912
1134 GB/sec
V100 v2
32 GB
5120
900 GB/sec
Yes
V100
16 GB
5120
900 GB/sec
Yes
P100
16 GB
3584
732 GB/sec
Yes
P40
24GB
3840
346 GB/sec
No
T4
16GB
2560
320 GB/sec
This procedure is considered experimental.
In some situations, you might not be able to upgrade NVIDIA CUDA drivers on a regular basis. To work around this issue, NVIDIA provides compatibility drivers that allow users to use newer features without requiring a full upgrade. For information about compatibility drivers, see https://docs.nvidia.com/deploy/cuda-compatibility/index.html.
Use the following commands to install the CUDA 11 compatibility drivers on Ubuntu:
After the last nvidia-smi
, ensure that CUDA shows the correct version.
The driver version will still show as the old version.
After installing the drivers, update the systemd files in /lib/systemd/system/heavydb.service.
In the service section, add or update the environment property
The file should look like that
Then force the reload of the systemd configuration
Getting Started with AWS AMI
You can use the HEAVY.AI AWS AMI (Amazon Web Services Amazon Machine Image) to try HeavyDB and Heavy Immerse in the cloud. Perform visual analytics with the included New York Taxi database, or import and explore your own data.
Many options are available when deploying an AWS AMI. These instructions skip to the specific tasks you must perform to deploy a sample environment.
You need a security key pair when you launch your HEAVY.AI instance. If you do not have one, create one before you continue.
Go to the EC2 Dashboard.
Select Key Pairs under Network & Security.
Click Create Key Pair.
Enter a name for your key pair. For example, MyKey
.
Click Create. The key pair PEM file downloads to your local machine. For example, you would find MyKey.pem
in your Downloads
directory.
Go to the AWS Marketplace page for HEAVY.AI and select the version you want to use. You can get overview information about the product, see pricing, and get usage and support information.
Click Continue to Subscribe to subscribe.
Read the Terms and Conditions, and then click Continue to Configuration.
Select the Fulfillment Option, Software Version, and Region.
Click Continue to Launch.
On the Launch this software page, select Launch through EC2, and then click Launch.
From the Choose and Instance Type page, select an available EC2 instance type, and click Review and Launch.
Review the instance launch details, and click Launch.
Select a key pair, or click Create a key pair to create a new key pair and download it, and then click Launch Instances.
On the Launch Status page, click the instance name to see it on your EC2 Dashboard Instances page.
To connect to Heavy Immerse, you need your Public IP address and Instance ID for the instance you created. You can find these values on the Description tab for your instance.
To connect to Heavy Immerse:
Point your Internet browser to the public IP address for your instance, on port 6273. For example, for public IP 54.83.211.182
, you would use the URL https://54.83.211.182:6273
.
If you receive an error message stating that the connection is not private, follow the prompts onscreen to click through to the unsecured website. To secure your site, see Tips for Securing Your EC2 Instance.
Enter the USERNAME (admin), PASSWORD ( {Instance ID} ), and DATABASE (heavyai). If you are using the BYOL version, enter you license key in the key field and click Apply.
Click Connect.
On the Dashboards page, click NYC Taxi Rides. Explore and filter the chart information on the NYC Taxis Dashboard.
For more information on Heavy Immerse features, see Introduction to Heavy Immerse.
Working with your own familiar dataset makes it easier to see the advantages of HEAVY.AI processing speed and data visualization.
To import your own data to Heavy Immerse:
Export your data from your current datastore as a comma-separated value (CSV) or tab-separated value (TSV) file. HEAVY.AI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI.
Point your Internet browser to the public IP address for your instance, on port 6273. For example, for public IP 54.83.211.182
, you would use the URL https://54.83.211.182:6273
.
Enter the USERNAME (admin) and PASSWORD ( {instance ID} ). If you are using the BYOL version, enter you license key in the key field and click Apply.
Click Connect.
Click Data Manager, and then click Import Data.
Drag your data file onto the table importer page, or use the directory selector.
Click Import Files.
Verify the column names and datatypes. Edit them if needed.
Enter a Name for your table.
Click Save Table.
Click Connect to Table.
On the New Dashboard page, click Add Chart.
Choose a chart type.
Add dimensions and measures as required.
Click Apply.
Enter a Name for your dashboard.
Click Save.
For more information, see Loading Data.
Follow these instructions to connect to your instance using SSH from MacOS or Linux. For information on connecting from Windows, see Connecting to Your Linux Instance from Windows Using PuTTY.
Open a terminal window.
Locate your private key file (for example, MyKey.pem). The wizard automatically detects the key you used to launch the instance.
Your key must not be publicly viewable for SSH to work. Use this command to change permissions, if needed:
Connect to your instance using its Public DNS. The default user name is centos
or ubuntu
, depending on the version you are using. For example:
Use the following command to run the heavysql SQL command-line utility on HeavyDB. The default user is admin
and the default password is { Instance ID }:
For more information, see heavysql.
This section is giving a recipe to upgrade from Omnisci platform 5.5+ to Heavi.ai 6.0.
If the version of Omnisci is older than 5.5 an intermediate upgrade step to the 5.5 version is needed. Check the docs on how to do the upgrade.
If you are upgrading from Omnisci to HEAVY.AI, there are a lot of additional steps compared to a simple sub-version upgrade.
IMPORTANT - Before you begin, stop all the running services / docker images of your Omnisci installation and create a backup $OMNISCI_STORAGE folder (typically /var/lib/omnisci). A backup is essential for recoverability; do not proceed with the upgrade without confirming that a full and consistent backup is available and ready to be restored.
The omnisci
the database will not be automatically renamed to the new default name heavyai
.This will be done manually and it's documented in the upgrade steps.
All the dumps created with the dump command on Omnisci cannot be restored after the database is upgraded to this version.
The following table describes the changes to environment variables, storage locations, and filenames in Release 6.0 compared to Release 5.x. Except where noted, revised storage subfolders, symlinks for old folder names, and filenames are created automatically on server start.
Change descriptions in bold require user intervention.
Environmental variable for storage location
$OMNISCI_STORAGE
$HEAVYAI_BASE
Default location for $HEAVYAI_BASE / $OMNISCI_STORAGE
/var/lib/omnisci
/var/lib/heavyai
Fixed location for Docker $HEAVYAI_BASE / $OMNISCI_STORAGE
/omnisci-storage
/var/lib/heavyai
The folder containing catalogs for $HEAVYAI_BASE / $OMNISCI_STORAGE
data/
storage/
Storage subfolder - data
data/mapd_data
storage/data
Storage subfolder - catalog
data/mapd_catalogs
storage/catalogs
Storage subfolder - import
data/mapd_import
storage/import
Storage subfolder - export
data/mapd_export
storage/export
Storage subfolder - logs
data/mapd_log
storage/log
Server INFO logs
omnisci_server.INFO
heavydb.INFO
Server ERROR logs
omnisci_server.ERROR
heavydb.ERROR
Server WARNING logs
omnisci_server.WARNING
heavydb.WARNING
Web Server ACCESS logs
omnisci_web_server.ACCESS
heavy_web_server.ACCESS
Web Server ALL logs
omnisci_web_server.ALL
heavy_web_server.ALL
Install directory
/omnisci (Docker) /opt/omnisci (bare metal)
/opt/heavyai/ (Docker and bare metal)
Binary file - core server (located in install directory)
bin/omnsici_server
bin/heavydb
Binary file - web server (located in install directory)
bin/omnisci_web_server
bin/heavy_web_server
Binary file - command- line SQL utility
bin/omnisql
bin/heavysql
Binary file - JDBC jar
bin/omnisci-jdbc-5.10.2-SNAPSHOT.jar
bin/heavydb-jdbc-6.0.0-SNAPSHOT.jar
Binary file - Utilities (SqlImporter) jar
bin/omnisci-utility-5.10.2-SNAPSHOT.jar
bin/heavydb-utility-6.0.0-SNAPSHOT.jar
HEAVY.AI Server service (for bare metal install)
omnisci_server
heavydb
HEAVY.AI Web Server service (for bare metal install)
omnisci_web_server
heavy_web_server
Default configuration file
omnisci.conf
heavy.conf
The order of these instructions is significant. To avoid problems, follow the order of the instruction provided and don't skip any step.
This upgrade procedure is assuming that you are using the default storage location for both Omnisci and HEAVY.AI.
$OMNISCI_STORAGE
$HEAVYAI_BASE
/var/lib/omnisci
/var/lib/heavyai
Stop all containers running Omnisci services.
In a terminal window, get the Docker container IDs:
You should see an output similar to the following. The first entry is the container ID. In this example, it is 9e01e520c30c
:
Stop the HEAVY.AI Docker container. For example:
Backup the Omnisci data directory (typically /var/lib/omnisci
).
Rename the Omnisci data directory to reflect the HEAVY.AI naming scheme.
Create a new configuration file for heavydb changing the data parameter to point to the renamed data directory.
Rename the Omnisci license file (EE and FREE only).
Download and run the 6.0 version of the HEAVY.AI Docker image.
Select the tab depending on the Edition (Enterprise, Free, or Open Source) and execution Device (GPU or CPU) you are upgrading.
Check that Docker is up and running using a docker ps
command:
You should see output similar to the following:
Using the new container ID rename the default omnisci
database to heavyai
:
Check that everything is running as expected.
To upgrade an existing system installed with package managers or tarball. The commands upgrade HEAVY.AI in place without disturbing your configuration or stored data.
Stop the Omnisci services.
Backup the Omnisci data directory (typically /var/lib/omnisci
).
Create a user named heavyai
who will be the owner of the HEAVY.AI software and data on the filesystem.
Set a password for the user. It'll need when sudo-ing.
Login with the newly created user
Rename the Omnisci data directory to reflect the HEAVY.AI naming scheme and change the ownership to heavyai user.
Check that everything is in order and the semaphore directory is created
All the directories must belong to heavyai user and the directory catalogs
would be present
Rename the license file. (EE and FREE only)
Install the HEAVY.AI software following all the instructions for your Operative System. CentOS/RHEL and Ubuntu.
Please follow all the steps in the Installation and Configuration until Intialization step.
Log in with the heavyai user and be sure that the heavyai services are stopped.
Create a new configuration file for heavydb changing the data
parameter to point to the /var/lib/heavyai/storage
directory and the frontend
to the new install directory.
All the settings of the upgraded database will be moved to the new configuration file.
Now we have to complete the database migration.
Remove the semaphore directory.
To complete the upgrade, start the HEAVY.AI servers.
Check if the database migrated, running this command and checking for the Rebrand migration complete
message.
Rename the default omnisci
database to heavyai.
Run the command using an administrative user (typically admin
) with his password (default HyperInteractive)
Restart the database service and check that everything is running as expected.
After all the checks confirmed that the upgraded system is stable, clean up the system to remove the Omnisci install and relative system configuration. Remove permanently the configuration of the services.
Remove the installed software.
Delete the YUM or APT repositories.
This section is giving a recipe to upgrade between fully compatible products version.
This section is giving a recipe to upgrade between fully compatible products version.
As with any software upgrade, it is important that you back up your data before upgrading. Each release introduces efficiencies that are not necessarily compatible with earlier releases of the platform. HeavyAI is never expected to be backward compatible.
Back up the contents of your $HEAVYAI_STORAGE directory.
If you need to upgrade from Omnisci to HEAVY.AI 6.0 or later, please refer to the specific recipe.
Direct upgrades from Omnisci to HEAVY.AI version later than 6.0 aren't allowed nor supported.
To upgrade HEAVY.AI in place in Docker
In a terminal window, get the Docker container ID.
You should see output similar to the following. The first entry is the container ID. In this example, it is 9e01e520c30c
:
Stop the HEAVY.AI Docker container. For example:
Optionally, remove the HEAVY.AI Docker container. This removes unused Docker containers on your system and saves disk space.
Backup the Omnisci data directory (typically /var/lib/omnisci
)
Download the latest version of the HEAVY.AI Docker image according to the Edition and device you are actually coming from Select the tab depending on the Edition (Enterprise, Free, or Open Source) and execution Device (GPU or CPU) you are upgrading.
If you don't want to upgrade to the latest version but want to upgrade to a specific version, change thelatest
tag with the version needed.
If the version needed is the 6.0 use v6.0.0 as the version tag in the image name
heavyai/heavyai-ee-cuda:v6.0.0
Check that the docker is up and running a docker ps commnd:
You should see an output similar to the following.
This runs both the HEAVY.AI database and Immerse in the same container.
You can optionally add --rm
to the Docker run
command so that the container is removed when it is stopped.
See also the note regarding the CUDA JIT Cache in Optimizing Performance.
To upgrade an existing system installed with package managers or tarball. The commands upgrade HEAVY.AI in place without disturbing your configuration or stored data
Stop the HEAVY.AI services.
Back up your $HEAVYAI_STORAGE directory (the default location is /var/lib/heavyai
).
Run the appropriate set of commands depending on the method used to install the previous version of the software.
Make a backup of your actual installation
Download and Install the latest version following the install documentation for your Operative System CentOS/RHEL and Ubuntu
When the upgrade is complete, start the HEAVY.AI services.
In this section, you will find recipes to upgrade from the OmniSci to the HEAVY.AI platform and upgrade between versions of the HEAVY.AI platform.
It's not always possible to upgrade from your actual product version to the last one, but one or more intermediate upgrade steps are needed.
The following table shows the steps needed to move from one software version to another.
OmniSci less then 5.5
HEAVY.AI 7.0
Upgrade to 5.5 --> --> 7.0
OmniSci 5.5-5.10
HEAVY.AI 7.0
Upgrade to --> 7.0
HEAVY.AI 6.0-6.4
HEAVY.AI 7.0
Upgrade to 7.0
Versions 5.x and 6.0.0 are not currently supported; use these only as needed to facilitate an upgrade to a supported version.
As an example, if you are using an OmniSci version older than 5.5, you must upgrade to 5.5, then to 6.0 and after that to 7.0, while if you are using 6.0-6.4.4 you can upgrade to 7.0.0 in a single step.
HeavyDB includes the utilities for database initialization and for generating certificates and private keys for an HTTPS server.
Before using HeavyDB, initialize the data directory using initdb
:
This creates three subdirectories:
catalogs
: Stores HeavyDB catalogs
data
: Stores HeavyDB data
log
: Contains all HeavyDB log files.
disk_cache
: Stores the data cached by HEAVY COnnect
The -f
flag forces initdb
to overwrite existing data and catalogs in the specified directory.
By default, initdb
adds a sample table of geospatial data. Use the --skip-geo
flag if you prefer not to load sample geospatial data.
This command generates certificates and private keys for an HTTPS server. The options are:
[{-ca} <bool>]
: Whether this certificate should be its own Certificate Authority. The default is false
.
[{-duration} <duration>]
: Duration that the certificate is valid for. The default is 8760h0m0s
.
[{-ecdsa-curve} <string>]
: ECDSA curve to use to generate a key. Valid values are P224
, P256
, P384
, P521
.
[{-host} <string>]
: Comma-separated hostnames and IPs to generate a certificate for.
[{-rsa-bits} <int>]
: Size of RSA key to generate. Ignored if –ecdsa-curve is set. The default is 2048
.
[{-start-date} <string>]
: Start date formatted as Jan 1 15:04:05 2011
This is a recipe to permanently remove HEAVY.AI Software, services, and data from your system.
To uninstall HEAVY.AI in Docker, stop and delete the current Docker container.
In a terminal window, get the Docker container ID:
You should see an output similar to the following. The first entry is the container ID. In this example, it is 9e01e520c30c
:
To see all containers, both running and stopped, use the following command:
Stop the HEAVY.AI Docker container. For example:
Remove the HEAVY.AI Docker container to save disk space. For example:
To uninstall an existing system installed with Yum, Apt, or Tarball connect using the user that runs the platform, typically heavyai.
Disable and stop all HEAVY.AI services.
Remove the HEAVY.AI Installation files. (the $HEAVYAI_PATH defaults to /opt/heavyai
)
Delete the configuration files and the storage removing the $HEAVYAI_BASE directory. (defaults to /var/lib/heavyai
)
Remove permanently the configuration of the services.
HEAVY.AI uses the following ports.
HEAVY.AI features two system services: heavydb
and heavy_web_server
. You can start these services individually using systemd
.
systemd
For permanent installations of HeavyDB, HEAVY.AI recommends that you use systemd
to manage HeavyDB services. systemd
automatically handles tasks such as log management, starting the services on restart, and restarting the services if there is a problem.
In addition, systemd
manages the open-file limit in Linux. Some cloud providers and distributions set this limit too low, which can result in errors as your HEAVY.AI environment and usage grow. For more information about adjusting the limits on open files, see in .
You use the install_heavy_systemd.sh
script to prepare systemd
to run HEAVY.AI services. The script asks questions about your environment, then installs the systemd
service files in the correct location. You must run the script as the root user so that the script can perform tasks such as creating directories and changing ownership.
The install_heavy_systemd.sh
script asks for the information described in the following table.
systemd
To manually start HeavyDB using systemd
, run:
systemd
You can use systemd
to restart HeavyDB — for example, after making configuration changes:
systemd
To manually stop HeavyDB using systemd
, run:
To enable the HeavyDB services to start on restart, run:
You can customize the behavior of your HEAVY.AI servers by modifying your heavy.conf configuration file. See .
Port
Service
Use
6273
heavy_web_server
Used to access Heavy Immerse.
6274
heavydb tcp
Used by connectors (heavyai, omnisql, odbc, and jdbc) to access the more efficient Thrift API.
6276
heavy_web_server
Used to access the HTTP/JSON thrift API.
6278
heavydb http
Used to directly access the HTTP/binary thrift API, without having to proxy through heavy_web_server. Recommended for debugging use only.
Variable
Use
Default
Notes
HEAVYAI_PATH
Path to HeavyDB installation directory
Current install directory
HEAVY.AI recommends heavyai as the install directory.
HEAVYAI_BASE
Path to the storage directory for HeavyDB data and configuration files
heavyai
Must be dedicated to HEAVY.AI. The installation script creates the directory $HEAVYAI_STORAGE/data, generates an appropriate configuration file, and saves the file as $HEAVYAI_STORAGE/heavy.conf.
HEAVYAI_USER
User HeavyDB is run as
Current user
User must exist before you run the script.
HEAVYAI_GROUP
Group HeavyDB is run as
Current user's primary group
Group must exist before you run the script.
HEAVY.AI has minimal configuration requirements with a number of additional configuration options. This topic describes the required and optional configuration changes you can use in your HEAVY.AI instance.
In release 4.5.0 and higher, HEAVY.AI requires that all configuration flags used at startup match a flag on the HEAVY.AI server. If any flag is misspelled or invalid, the server does not start. This helps ensure that all settings are intentional and will not have an unexpected impact on performance or data integrity.
Before starting the HEAVY.AI server, you must initialize the persistent storage
directory. To do so, create an empty directory at the desired path, such as /var/lib/heavyai
.
Create the environment variable $HEAVYAI_BASE
.
2. Then, change the owner of the directory to the user that the server will run as ($HEAVYAI_USER):
where $HEAVYAI_USER is the system user account that the server runs as, such as heavyai
, and $HEAVYAI_BASE is the path to the parent of the HEAVY.AI server storage directory.
3. Run $HEAVYAI_PATH/bin/initheavy with the storage directory path as the argument:
Immerse serves the application from the root path (/) by default. To serve the application from a sub-path, you must modify the $HEAVYAI_PATH/frontend/app-config.js file to change the IMMERSE_PATH_PREFIX value. The Heavy Immerse path must start with a forward slash (/).
The configuration file stores runtime options for your HEAVY.AI servers. You can use the file to change the default behavior.
The heavy.conf file is stored in the $HEAVYAI_BASE directory. The configuration settings are picked up automatically by the sudo systemctl start heavydb
and sudo systemctl start heavy_web_server
commands.
Set the flags in the configuration file using the format <flag> = <value>
. Strings must be enclosed in quotes.
The following is a sample configuration file. The entry for data
path is a string and must be in quotes. The last entry in the first section, for null-div-by-zero
, is the Boolean value true
and does not require quotes.
To comment out a line in heavy.conf, prepend the line with the pound sign (#) character.
For encrypted backend connections, if you do not use a configuration file to start the database, Calcite expects passwords to be supplied through the command line, and calcite passwords will be visible in the processes table. If a configuration file is supplied, then passwords must be supplied in the file. If they are not, Calcite will fail.
Following are the parameters for runtime settings on HeavyDB. The parameter syntax provides both the implied value and the default value as appropriate. Optional arguments are in square brackets, while implied and default values are in parentheses.
For example, consider allow-loop-joins [=arg(=1)] (=0)
.
If you do not use this flag, loop joins are not allowed by default.
If you provide no arguments, the implied value is 1 (true) (allow-loop-joins
).
If you provide the argument 0, that is the same as the default (allow-loop-joins=0
).
If you provide the argument 1, that is the same as the implied value (allow-loop-joins=1
).
Flag
Description
Default Value
allow-cpu-retry [=arg]
Allow the queries that failed on GPU to retry on CPU, even when watchdog is enabled. When watchdog is enabled, most queries that run on GPU and throw a watchdog exception fail. Turn this on to allow queries that fail the watchdog on GPU to retry on CPU. The default behavior is for queries that run out of memory on GPU to throw an error if watchdog is enabled. Watchdog is enabled by default.
TRUE[1]
allow-local-auth-fallback
[=arg(=1)] (=0)
If SAML or LDAP logins are enabled, and the logins fail, this setting enables authentication based on internally stored login credentials. Command-line tools or other tools that do not support SAML might reject those users from logging in unless this feature is enabled. This allows a user to log in using credentials on the local database.
FALSE[0]
allow-loop-joins [=arg(=1)] (=0)
FALSE[0]
allowed-export-paths = ["root_path_1", root_path_2", ...]
Specify a list of allowed root paths that can be used in export operations, such as the COPY TO command. Helps prevent exploitation of security vulnerabilities and prevent server crashes, data breaches, and full remote control of the host machine. For example:
allowed-export-paths = ["/heavyai-storage/data/heavyai_export", "/home/centos"]
The list of paths must be on the same line as the configuration parameter.
Allowed file paths are enforced by default. The default export path (<data directory>/heavyai_export
) is allowed by default, and all child paths of that path are allowed.
When using commands with other paths, the provided paths must be under an allowed root path. If you try to use a nonallowed path in a COPY TO command, an error response is returned.
N/A
allow-s3-server-privileges
Allow S3 server privileges if IAM user credentials are not provided. Credentials can be specified with environment variables (such as AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and so on), an AWS credentials file, or when running on an EC2 instance, with an IAM role that is attached to the instance.
FALSE[0]
allowed-import-paths = ["root_path_1", "root_path_2", ...]
Specify a list of allowed root paths that can be used in import operations, such as the COPY FROM command. Helps prevent exploitation of security vulnerabilities and prevent server crashes, data breaches, and full remote control of the host machine.
For example:
allowed-import-paths = ["/heavyai-storage/data/heavyai_import", "/home/centos"]
The list of paths must be on the same line as the configuration parameter.
Allowed file paths are enforced by default. The default import path (<data directory>/heavyai_import
) is allowed by default, and all child paths of that allowed path are allowed.
When using commands with other paths, the provided paths must be under an allowed root path. If you try to use a nonallowed path in a COPY FROM command, an error response is returned.
N/A
approx_quantile_buffer
arg
Size of a temporary buffer that is used to copy in the data for APPROX_MEDIAN calculation. When full, is sorted before being merged into the internal distribution buffer configured in approx_quantile_centroids
.
1000
approx_quantile_centroids
arg
Size of the internal buffer used to approximate the distribution of the data for which the APPOX_MEDIAN calculation is taken. The larger the value, the greater the accuracy of the answer.
300
auth-cookie-name
arg
Configure the authentication cookie name. If not explicitly set, the default name is oat
.
oat
bigint-count [=arg]
Use 64-bit count. Disabled by default because 64-bit integer atomics are slow on GPUs. Enable this setting if you see negative values for a count, indicating overflow. In addition, if your data set has more than 4 billion records, you likely need to enable this setting.
FALSE[0]
bitmap-memory-limit
arg
Set the maximum amount of memory (in GB) allocated for APPROX_COUNT_DISTINCT bitmaps per execution kernel (thread or GPU).
8
calcite-max-mem arg
Max memory available to calcite JVM. Change if Calcite reports out-of-memory errors.
1024
calcite-port arg
Calcite port number. Change to avoid collisions with ports already in use.
6279
calcite-service-timeout
Service timeout value, in milliseconds, for communications with Calcite. On databases with large numbers of tables, large numbers of concurrent queries, or many parallel updates and deletes, Calcite might return less quickly. Increasing the timeout value can prevent THRIFT_EAGAIN timeout errors.
5000
columnar-large-projections[=arg]
Sets automatic use of columnar output, instead of row-wise output, for large projections.
TRUE
columnar-large-projections-threshold arg
Set the row-number threshold size for columnar output instead of row-wise output.
1000000
config arg
Path to heavy.conf. Change for testing and debugging.
$HEAVYAI_STORAGE/ heavy.conf
cpu-only
Run in CPU-only mode. Set this flag to force HeavyDB to run in CPU mode, even when GPUs are available. Useful for debugging and on shared-tenancy systems where the current HeavyDB instance does not need to run on GPUs.
FALSE
cpu-buffer-
mem-bytes arg
Size of memory reserved for CPU buffers [bytes]. Change to restrict the amount of CPU/system memory HeavyDB can consume. A default value of 0 indicates no limit on CPU memory use. (HEAVY.AI Server uses all available CPU memory on the system.)
0
cuda-block-size arg
Size of block to use on GPU. GPU performance tuning: Number of threads per block. Default of 0 means use all threads per block.
0
cuda-grid-size arg
Size of grid to use on GPU. GPU performance tuning: Number of blocks per device. Default of 0 means use all available blocks per device.
0
data arg
Directory path to HEAVY.AI catalogs. Change for testing and debugging.
$HEAVYAI_STORAGE
db-query-list arg
N/A
dynamic-watchdog-time-limit [=arg]
Dynamic watchdog time limit, in milliseconds. Change if dynamic watchdog is stopping queries expected to take longer than this limit.
100000
enable-auto-clear-render-mem [=arg]
Enable/disable clear render gpu memory on out-of-memory errors during rendering. If an out-of-gpu-memory exception is thrown while rendering, many users respond by running \clear_gpu
via the heavysql command-line interface to refresh/defrag the memory heap. This process can be automated with this flag enabled. At present, only GPU memory in the renderer is cleared automatically.
TRUE[1]
enable-auto-metadata-update [=arg]
Enable automatic metadata updates on UPDATE queries. Automatic metadata updates are turned on by default. Disabling may result in stale metadata and reductions in query performance.
TRUE[1]
enable-columnar-output [=arg]
Allows HEAVY.AI Core to directly materialize intermediate projections and the final ResultSet in Columnar format where appropriate. Columnar output is an internal performance enhancement that projects the results of an intermediate processing step in columnar format. Consider disabling this feature if you see unexpected performance regressions in your queries.
TRUE[1]
enable-data-recycler [=arg]
Set to TRUE to enable the data recycler. Enabling the recycler enables the following:
Hashtable recycler, which is the cache storage.
Hashing scheme recycler, which preserves a hashtable layout (such as perfect hashing and keyed hashing).
Overlaps hashtable tuning parameter recycler. Each overlap hashtable has its own parameters used during hashtable building.
TRUE[0]
enable-debug-timer [=arg]
Enable fine-grained query execution timers for debug. For debugging, logs verbose timing information for query execution (time to load data, time to compile code, and so on).
FALSE[0]
enable-direct-columnarization
[=arg(=1)](=0)
Columnarization organizes intermediate results in a multi-step query in the most efficient way for the next step in the process. If you see an unexpected performance regression, you can try setting this value to false, enabling the earlier HEAVY.AI columnarization behavior.
TRUE[1]
enable-dynamic-watchdog [=arg]
Enable dynamic watchdog.
FALSE[0]
enable-filter-push-down [=arg(=1)] (=0)
FALSE[0]
enable-foreign-table-scheduled-refresh
[=arg]
Enable scheduled refreshes of foreign tables. Enables automated refresh of foreign tables with "REFRESH_TIMING_TYPE" option of "SCHEDULED" based on the specified refresh schedule.
TRUE[1]
enable-geo-ops-on-uncompressed-coords [=arg(=1)] (=0)
Allow geospatial operations ST_Contains
and ST_Intersects
to process uncompressed coordinates where possible to increase execution speed.
Provides control over the selection of ST_Contains
and ST_Intersects
implementations. By default, for certain combinations of compressed geospatial arguments, such as ST_Contains(POLYGON, POINT)
, the implementation can process uncompressed coordinate values. This can result in much faster execution but could decrease precision. Disabling this option enables full decompression, which is slower but more precise.
TRUE[1]
enable-logs-system-tables [=arg(=1)] (=0)
Enable use of logs system tables. Also enables the Request Logs and Monitoring system dashboard (Enterprise Edition only).
FALSE[0]
enable-overlaps-hashjoin [=arg(=1)] (=0)
Enable the overlaps hash join framework allowing for range join (for example, spatial overlaps) computation using a hash table.
TRUE[1]
enable-runtime-query-interrupt [=arg(=1)] (=0)
FALSE[0]
enable-runtime-udf
Enable runtime user defined function registration. Enables runtime registration of user defined functions. This functionality is turned off unless you specifically request it, to prevent unintentional inclusion of nonstandard code. This setting is a precursor to more advanced object permissions planned in future releases.
FALSE[0]
enable-string-dict-hash-cache[=arg(=1)] (=0)
When importing a large table with low cardinality, set the flag to TRUE and leave it on to assist with bulk queries. If using String Dictionary Server, set the flag to FALSE if the String Dictionary server uses more memory than the physical system can support.
TRUE[1]
enable-thrift-logs [=arg(=1)] (=0)
Enable writing messages directly from Thrift to stdout/stderr. Change to enable verbose Thrift messages on the console.
FALSE[0]
enable-watchdog [arg]
Enable watchdog.
TRUE[1]
filter-push-down-low-frac
Higher threshold for selectivity of filters which are pushed down. Filters with selectivity lower than this threshold are considered for a push down.
filter-push-down-passing-row-ubound
Upper bound on the number of rows that should pass the filter if the selectivity is less than the high fraction threshold.
flush-log [arg]
Immediately flush logs to disk. Set to FALSE if this is a performance bottleneck.
TRUE[1]
from-table-reordering [=arg(=1)] (=1)
Enable automatic table reordering in FROM clause. Reorders the sequence of a join to place large tables on the inside of the join clause and smaller tables on the outside. HEAVY.AI also reorders tables between join clauses to prefer hash joins over loop joins. Change this value only in consultation with an HEAVY.AI engineer.
TRUE[1]
gpu-buffer-mem-bytes [=arg]
Size of memory reserved for GPU buffers in bytes per GPU. Change to restrict the amount of GPU memory HeavyDB can consume per GPU. A default value of 0 indicates no limit on GPU memory use (HeavyDB uses all available GPU memory across all active GPUs on the system).
0
gpu-input-mem-limit arg
Force query to CPU when input data memory usage exceeds this percentage of available GPU memory. OmniSciDB loads data to GPU incrementally until data exceeds GPU memory, at which point the system retries on CPU. Loading data to GPU evicts any resident data already loaded or any query results that are cached. Use this limit to avoid attempting to load datasets to GPU when they obviously will not fit, preserving cached data on GPU and increasing query performance.
If watchdog is enabled and allow-cpu-retry
is not enabled, the query fails instead of re-running on CPU.
0.9
hashtable-cache-total-bytes [=arg]
The total size of the cache storage for hashtable recycler, in bytes. Increase the cache size to store more hashtables. Must be larger than or equal to the value defined in max-cacheable-hashtable-size-bytes
.
4294967296 (4GB)
hll-precision-bits [=arg]
Number of bits used from the hash value used to specify the bucket number. Change to increase or decrease approx_count_distinct()
precision. Increased precision decreases performance.
11
http-port arg
HTTP port number. Change to avoid collisions with ports already in use.
6278
idle-session-duration arg
Maximum duration of an idle session, in minutes. Change to increase or decrease duration of an idle session before timeout.
60
inner-join-fragment-skipping [=arg(=1)] (=0)
Enable or disable inner join fragment skipping. Enables skipping fragments for improved performance during inner join operations.
FALSE[0]
license arg
Path to the file containing the license key. Change if your license file is in a different location or has a different name.
log-auto-flush
Flush logging buffer to file after each message. Changing to false can improve performance, but log lines might not appear in the log for a very long time. HEAVY.AI does not recommend changing this setting.
TRUE[1]
log-directory arg
Path to the log directory. Can be either a relative path to the $HEAVYAI_STORAGE/data directory or an absolute path. Use this flag to control the location of your HEAVY.AI log files. If the directory does not exist, HEAVY.AI creates the top level directory. For example, a/b/c/logdir is created only if the directory path a/b/c already exists.
/var/lib/heavyai/ data/heavyai_log
log-file-name
Boilerplate for the name of the HEAVY.AI log files. You can customize the name of your HEAVY.AI log files. {SEVERITY} is the only braced token recognized. It allows you to create separate files for each type of error message greater than or equal to the log-severity configuration option.
heavydb.{SEVERITY}. %Y%m%d-%H%M%S.log
log-max-files
Maximum number of log files to keep. When the number of log files exceeds this number, HEAVY.AI automatically deletes the oldest files.
100
log-min-free-space
Minimum number of bytes left on device before oldest log files are deleted. This is a safety feature to be sure the disk drive of the log directory does not fill up, and guarantees that at least this many bytes are free.
20971520
log-rotation-size
Maximum file size in bytes before new log files are started. Change to increase/decrease size of files. If log files fill quickly, you might want to increase this number so that there are fewer log files.
10485760
log-rotate-daily
Start new log files at midnight. Set to false to write to log files until they are full, rather than restarting each day.
TRUE[1]
log-severity
Log to file severity levels:
DEBUG4
DEBUG3
DEBUG2
DEBUG1
INFO
WARNING
ERROR
FATAL
All levels after your chosen base severity level are listed. For example, if you set the severity level to WARNING, HEAVY.AI only logs WARNING, ERROR, and FATAL messages.
INFO
log-severity-clog
Log to console severity level: INFO WARNING ERROR FATAL. Output chosen severity messages to STDERR from running process.
WARNING
log-symlink
heavydb. {SEVERITY}.log
log-user-id
Log internal numeric user IDs instead of textual user names.
log-user-origin
Look up the origin of inbound connections by IP address and DNS name and print this information as part of stdlog. Some systems throttle DNS requests or have other network constraints that preclude timely return of user origin information. Set to FALSE to improve performance on those networks or when large numbers of users from different locations make rapid connect/disconnect requests to the server.
TRUE[1]
logs-system-tables-max-files-count [=arg]
Maximum number of log files that can be processed by each logs system table.
100
max-cacheable-hashtable-size-bytes [=arg]
Maximum size of the hashtable that the hashtable recycler can store. Limiting the size can enable more hashtables to be stored. Must be lesser than or equal to the value defined in hashtable-cache-total-bytes
.
2147483648 (2GB)
max-session-duration arg
Maximum duration of the active session, in minutes. Change to increase or decrease session duration before timeout.
43200 (30 days)
null-div-by-zero [=arg]
Allows processing to complete when when the dataset would cause a divide by zero error. Set to TRUE if you prefer to return null when dividing by zero, and set to FALSE to throw an exception.
FALSE[0]
num-executors
arg
Beta functionality in Release 5.7. Set the number of executors.
num-gpus
arg
-1
num-reader-threads
arg
Number of reader threads to use. Drop the number of reader threads to prevent imports from using all available CPU power. Default is to use all threads.
0
overlaps-bucket-
threshold arg
The minimum size of a bucket corresponding to a given inner table range for the overlaps hash join.
-p | port int
HeavyDB server port. Change to avoid collisions with other services if 6274 is already in use.
6274
pending-query-interrupt-freq=
arg
Frequency with which to check the interrupt status of pending queries, in milliseconds. Values larger than 0 are valid. If you set pending-query-interrupt-freq=100
, each session's interrupt status is checked every 100 ms.
For example, assume you have three sessions (S1, S2, and S3) in your queue, and assume S1 contains a running query, and S2 and S3 hold pending queries. If you setpending-query-interrupt-freq=1000
both S2 and S3 are interrupted every 1000 ms (1 sec). See running-query-interrupt-freq
for information about interrupting running queries.
Decreasing the value increases the speed with which pending queries are removed, but also increases resource usage.
1000 (1 sec)
pki-db-client-auth [=
arg
]
Attempt authentication of users through a PKI certificate. Set to TRUE for the server to attempt PKI authentication.
FALSE[0]
read-only [=arg(=1)]
Enable read-only mode. Prevents changes to the dataset.
FALSE[0]
render-mem-bytes arg
Specifies the size of a per-GPU buffer that render query results are written to; allocated at the first rendering call. Persists while the server is running unless you run \clear_gpu_memory
. Increase if rendering a large number of points or symbols and you get the following out-of-memory exception: Not enough OpenGL memory to render the query results.
Default is 500 MB.
500000000
render-oom-retry-threshold = arg
A render execution time limit in milliseconds to retry a render request if an out-of-gpu-memory error is thrown. Requires enable-auto-clear-render-mem = true.
If enable-auto-clear-render-mem
= true, a retry of the render request can be performed after an out-of-gpu-memory exception. A retry only occurs if the first run took less than the threshold set here (in milliseconds). The retry is attempted after the render gpu memory is automatically cleared. If an OOM exception occurs, clearing the memory might get the request to succeed. Providing a reasonable threshold might give more stability to memory-constrained servers w/ rendering enabled. Only a single retry is attempted. A value of 0 disables retries.
rendering [=arg]
Enable or disable backend rendering. Disable rendering when not in use, freeing up memory reserved by render-mem-bytes
. To reenable rendering, you must restart HEAVY.AI Server.
TRUE[1]
res-gpu-mem =arg
Reserved memory for GPU. Reserves extra memory for your system (for example, if the GPU is also driving your display, such as on a laptop or single-card desktop). HEAVY.AI uses all the memory on the GPU except for render-mem-bytes
+ res-gpu-mem
. Also useful if other processes, such as a machine-learning pipeline, share the GPU with HEAVY.AI. In advanced rendering scenarios or distributed setups, increase to free up additional memory for the renderer, or for aggregating results for the renderer from multiple leaf nodes. HEAVY.AI recommends always setting res-gpu-mem
when using backend rendering.
134217728
running-query-interrupt-freq
arg
Controls the frequency of interruption status checking for running queries. Range: 0.0 (less frequently) to 1.0 (more frequently).
For example, if you have 10 threads that evaluate a query of a table that has 1000 rows, then each thread advances its thread index up to 10 times. In this case, if you set the flag close to 1.0, you check a session's interrupt status for every increment of the thread index.
If we set the flag value as close to 0.0, you only check the session's interrupt status when the index increment is close to 10. The default value of running interrupt checking is close to half of the maximum increment of the thread index.
Frequent interrupt status checking reduces latency for the interrupt but also can decrease query performance.
seek-kafka-commit = <N>
Set the offset of the last Kafka message to be committed from a Kafka data stream. Set the offset of the last Kafka message to be committed from a Kafka data stream. This way, Kafka does not resend those messages. After the Kafka server commits messages through the number N, it resends messages starting at message N+1. This is particularly useful when you want to create a replica of the HEAVY.AI server from an existing data directory.
N/A
ssl-cert path
Path to the server's public PKI certificate (.crt file). Define the path the the .crt file. Used to establish an encrypted binary connection.
ssl-keystore path
Path to the server keystore. Used for an encrypted binary connection. The path to Java trust store containing the server's public PKI key. Used by HeavyDB to connect to the encrypted Calcite server port.
ssl-keystore-password password
The password for the SSL keystore. Used to create a binary encrypted connection to the Calcite server.
ssl-private-key path
Path to the server's private PKI key. Define the path to the HEAVY.AI server PKI key. Used to establish an encrypted binary connection.
ssl-trust-ca path
Enable use of CA-signed certificates presented by Calcite. Defines the file that contains trusted CA certificates. This information enables the server to validate the TCP/IP Thrift connections it makes as a client to the Calcite server. The certificate presented by the Calcite server is the same as the certificate used to identify the database server to its clients.
ssl-trust-ca-server path
ssl-trust-password password
The password for the SSL trust store. Password to the SSL trust store containing the server's public PKI key. Used to establish an encrypted binary connection.
ssl-trust-store path
The path to Java trustStore containing the server's public PKI key. Used by the Calcite server to connect to the encrypted OmniSci server port, to establish an encrypted binary connection.
start-gpu arg
FALSE[0]
trivial-loop-join-threshold [=arg]
The maximum number of rows in the inner table of a loop join considered to be trivially small.
1000
use-hashtable-cache
Set to TRUE to enable the hashtable recycler. Supports complex scenarios, such as hashtable recycling for queries that have subqueries.
TRUE[1]
vacuum-min-selectivity [=arg]
Specify the percentage (with a value of 0 implying 0% and a value of 1 implying 100%) of deleted rows in a fragment at which to perform automatic vacuuming.
Automatic vacuuming occurs when deletes or updates on variable-length columns result in a percentage of deleted rows in a fragment exceeding the specified threshold. The default threshold is 10% of deleted rows in a fragment.
When changing this value, consider the most common types of queries run on the system. In general, if you have infrequent updates and deletes, set vacuum-min-selectivity
to a low value. Set it higher if you have frequent updates and deletes, because vacuuming adds overhead to affected UPDATE and DELETE queries.
watchdog-none-encoded-string-translation-limit [=arg]
The number of strings that can be casted using the ENCODED_TEXT string operator.
1,000,000
window-function-frame-aggregation-tree-fanout [=arg]
Tree fan out of the aggregation tree is used to compute aggregation over the window frame.
8
Following are additional parameters for runtime settings for the Enterprise Edition of HeavyDB. The parameter syntax provides both the implied value and the default value as appropriate. Optional arguments are in square brackets, while implied and default values are in parentheses.
Flag
Description
Default Value
cluster arg
Path to data leaves list JSON file. Indicates that the HEAVY.AI server instance is an aggregator node, and where to find the rest of its cluster. Change for testing and debugging.
$HEAVYAI_BASE
compression-limit-bytes [=arg(=536870912)] (=536870912)
Compress result sets that are transferred between leaves. Minimum length of payload above which data is compressed.
536870912
compressor arg (=lz4hc)
lz4hc
ldap-dn arg
LDAP Distinguished Name.
ldap-role-query-regex arg
RegEx to use to extract role from role query result.
ldap-role-query-url arg
LDAP query role URL.
ldap-superuser-role arg
The role name to identify a superuser.
ldap-uri arg
LDAP server URI.
leaf-conn-timeout [=arg]
Leaf connect timeout, in milliseconds. Increase or decrease to fail Thrift connections between HeavyDB instances more or less quickly if a connection cannot be established.
20000
leaf-recv-timeout [=arg]
Leaf receive timeout, in milliseconds. Increase or decrease to fail Thrift connections between HeavyDB instances more or less quickly if data is not received in the time allotted.
300000
leaf-send-timeout [=arg]
Leaf send timeout, in milliseconds. Increase or decrease to fail Thrift connections between HeavyDB instances more or less quickly if data is not sent in the time allotted.
300000
saml-metadata-file arg
Path to identity provider metadata file.
Required for running SAML. An identity provider (like Okta) supplies a metadata file. From this file, HEAVY.AI uses:
Public key of the identity provider to verify that the SAML response comes from it and not from somewhere else.
URL of the SSO login page used to obtain a SAML token.
saml-sp-target-url arg
URL of the service provider for which SAML assertions should be generated. Required for running SAML. Used to verify that a SAML token was issued for HEAVY.AI and not for some other service.
saml-sync-roles arg (=0)
Enable mapping of SAML groups to HEAVY.AI roles. The SAML Identity provider (for example, Okta) automatically creates users at login and assigns them roles they already have as groups in SAML.
saml-sync-roles [=0]
string-servers arg
Path to string servers list JSON file. Indicates that HeavyDB is running in distributed mode and is required to designate a leaf server when running in distributed mode.
HEAVY.AI supports data security using a set of database object access privileges granted to users or roles.
When you create a database, the admin
superuser is created by default. The admin
superuser is granted all privileges on all database objects. Superusers can create new users that, by default, have no database object privileges.
Superusers can grant users selective access privileges on multiple database objects using two mechanisms: role-based privileges and user-based privileges.
Grant roles access privileges on database objects.
Grant roles to users.
Grant roles to other roles.
When a user has privilege requirements that differ from role privileges, you can grant privileges directly to the user. These mechanisms provide data security for many users and classes of users to access the database.
You have the following options for granting privileges:
Each object privilege can be granted to one or many roles, or to one or many users.
A role and/or user can be granted privileges on one or many objects.
A role can be granted to one or many users or other roles.
A user can be granted one or many roles.
This supports the following many-to-many relationships:
Objects and roles
Objects and users
Roles and users
These relationships provide flexibility and convenience when granting/revoking privileges to and from users.
Granting object privileges to roles and users, and granting roles to users, has a cumulative effect. The result of several grant commands is a combination of all individual grant commands. This applies to all database object types and to privileges inherited by objects. For example, object privileges granted to the object of database type are propagated to all table-type objects of that database object.
Only a superuser or an object owner can grant privileges for on object.
A superuser has all privileges on all database objects.
A non-superuser user has only those privileges on a database object that are granted by a superuser.
A non-superuser user has ALL
privileges on a table created by that user.
Roles can be created and dropped at any time.
Object privileges and roles can be granted or revoked at any time, and the action takes effect immediately.
Privilege state is persistent and restored if the HEAVY.AI session is interrupted.
There are five database object types, each with its own privileges.
ACCESS - Connect to the database. The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.
ALL - Allow all privileges on this database except issuing grants and dropping the database.
SELECT, INSERT, TRUNCATE, UPDATE, DELETE - Allow these operations on any table in the database.
ALTER SERVER - Alter servers in the current database.
CREATE SERVER - Create servers in the current database.
CREATE TABLE - Create a table in the current database. (Also CREATE.)
CREATE VIEW - Create a view for the current database.
CREATE DASHBOARD - Create a dashboard for the current database.
DELETE DASHBOARD - Delete a dashboard for this database.
DROP SERVER - Drop servers from the current database.
DROP - Drop a table from the database.
DROP VIEW - Drop a view for this database.
EDIT DASHBOARD - Edit a dashboard for this database.
SELECT VIEW - Select a view for this database.
SERVER USAGE - Use servers (through foreign tables) in the current database.
VIEW DASHBOARD - View a dashboard for this database.
VIEW SQL EDITOR - Access the SQL Editor in Immerse for this database.
Users with SELECT privilege on views do not require SELECT privilege on underlying tables referenced by the view to retrieve the data queried by the view. View queries work without error whether or not users have direct access to referenced tables. This also applies to views that query tables in other databases.
To create views, users must have SELECT privilege on queried tables in addition to the CREATE VIEW privilege.
SELECT, INSERT, TRUNCATE, UPDATE, DELETE - Allow these SQL statements on this table.
DROP - Drop this table.
Users with SELECT privilege on views do not require SELECT privilege on underlying tables referenced by the view to retrieve the data queried by the view. View queries work without error whether or not users have direct access to referenced tables. This also applies to views that query tables in other databases.
To create views, users must have SELECT privilege on queried tables in addition to the CREATE VIEW privilege.
SELECT - Select from this view. Users do not need privileges on objects referenced by this view.
DROP - Drop this view.
Users with SELECT privilege on views do not require SELECT privilege on underlying tables referenced by the view to retrieve the data queried by the view. View queries work without error whether or not users have direct access to referenced tables. This also applies to views that query tables in other databases.
To create views, users must have SELECT privilege on queried tables in addition to the CREATE VIEW privilege.
VIEW - View this dashboard.
EDIT - Edit this dashboard.
DELETE - Delete this dashboard.
DROP - Drop this server from the current database.
ALTER - Alter this server in the current database.
USAGE - Use this server (through foreign tables) in the current database.
Privileges granted on a database-type object are inherited by all tables of that database.
The following example shows a valid sequence for granting access privileges to non-superuser user1
by granting a role to user1
and by directly granting a privilege. This example presumes that table1
and user1
already exist, and that user1
has ACCESS privileges on the database where table1
exists.
Create the r_select
role.
Grant the SELECT privilege on table1
to the r_select
role. Any user granted the r_select
role gains the SELECT privilege.
Grant the r_select
role to user1
, giving user1
the SELECT privilege on table1
.
Directly grant user1
the INSERT privilege on table1
.
Create a role. Roles are granted to users for role-based database object access.
This clause requires superuser privilege and <roleName> must not exist.
<roleName>
Name of the role to create.
Create a payroll department role called payrollDept.
Remove a role.
This clause requires superuser privilege and <roleName> must exist.
<roleName>
Name of the role to drop.
Remove the payrollDept role.
Grant role privileges to users and to other roles.
The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.
This clause requires superuser privilege. The specified <roleNames> and <userNames> must exist.
<roleNames>
Names of roles to grant to users and other roles. Use commas to separate multiple role names.
<userNames>
Names of users. Use commas to separate multiple user names.
Assign payrollDept role privileges to user dennis.
Grant payrollDept and accountsPayableDept role privileges to users dennis and mike and role hrDept.
Remove role privilege from users or from other roles. This removes database object access privileges granted with the role.
This clause requires superuser privilege. The specified <roleNames> and <userNames> must exist.
<roleNames>
Names of roles to remove from users and other roles. Use commas to separate multiple role names.
<userName>
Names of the users. Use commas to separate multiple user names.
Remove payrollDept role privileges from user dennis.
Revoke payrollDept and accountsPayableDept role privileges from users dennis and fred and role hrDept.
Define the privilege(s) a role or user has on the specified table. You can specify any combination of the INSERT
, SELECT
, DELETE
, UPDATE
, DROP
, or TRUNCATE
privilege or specify all privileges.
The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.
This clause requires superuser privilege, or <tableName> must have been created by the user invoking this command. The specified <tableName> and users or roles defined in <entityList> must exist.
<privilegeList>
<tableName>
Name of the database table.
<entityList>
Name of entity or entities to be granted the privilege(s).
Permit all privileges on the employees
table for the payrollDept role.
Permit SELECT-only privilege on the employees
table for user chris.
Permit INSERT-only privilege on the employees
table for the hrdept and accountsPayableDept roles.
Permit INSERT, SELECT, and TRUNCATE privileges on the employees
table for the role hrDept and for users dennis and mike.
Remove the privilege(s) a role or user has on the specified table. You can remove any combination of the INSERT
, SELECT
, DELETE
, UPDATE
, or TRUNCATE
privileges, or remove all privileges.
This clause requires superuser privilege or <tableName> must have been created by the user invoking this command. The specified <tableName> and users or roles in <entityList> must exist.
<privilegeList>
<tableName>
Name of the database table.
<entityList>
Name of entities to be denied the privilege(s).
Prohibit SELECT and INSERT operations on the employees
table for the nonemployee role.
Prohibit SELECT operations on the directors
table for the employee role.
Prohibit INSERT operations on the directors
table for role employee and user laura.
Prohibit INSERT, SELECT, and TRUNCATE privileges on the employees
table for the role nonemployee and for users dennis and mike.
Define the privileges a role or user has on the specified view. You can specify any combination of the SELECT
, INSERT
, or DROP
privileges, or specify all privileges.
This clause requires superuser privileges, or <viewName> must have been created by the user invoking this command. The specified <viewName> and users or roles in <entityList> must exist.
<privilegeList>
<viewName>
Name of the database view.
<entityList>
Name of entities to be granted the privileges.
Permit SELECT, INSERT, and DROP privileges on the employees
view for the payrollDept role.
Permit SELECT-only privilege on the employees
view for the employee role and user venkat.
Permit INSERT and DROP privileges on the employees
view for the hrDept and acctPayableDept roles and users simon and dmitri.
Remove the privileges a role or user has on the specified view. You can remove any combination of the INSERT
, DROP
, or SELECT
privileges, or remove all privileges.
This clause requires superuser privilege, or <viewName> must have been created by the user invoking this command. The specified <viewName> and users or roles in <entityList> must exist.
<privilegeList>
<viewName>
Name of the database view.
<entityList>
Name of entity to be denied the privilege(s).
Prohibit SELECT, DROP, and INSERT operations on the employees
view for the nonemployee role.
Prohibit SELECT operations on the directors
view for the employee role.
Prohibit INSERT and DROP operations on the directors
view for the employee and manager role and for users ashish and lindsey.
Define the valid privileges a role or user has on the specified database. You can specify any combination of privileges, or specify all privileges.
The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.
This clause requires superuser privileges.
<privilegeList>
<dbName>
Name of the database, which must exist, created by CREATE DATABASE.
<entityList>
Name of the entity to be granted the privilege.
Permit all operations on the companydb
database for the payrollDept role and user david.
Permit SELECT-only operations on the companydb
database for the employee role.
Permit INSERT, UPDATE, and DROP operations on the companydb
database for the hrdept and manager role and for users irene and stephen.
Remove the operations a role or user can perform on the specified database. You can specify privileges individually or specify all privileges.
This clause requires superuser privilege or the user must own the database object. The specified <dbName> and roles or users in <entityList> must exist.
<privilegeList>
<dbName>
Name of the database.
<entityList>
Prohibit all operations on the employees
database for the nonemployee role.
Prohibit SELECT operations on the directors
database for the employee role and for user monica.
Prohibit INSERT, DROP, CREATE, and DELETE operations on the directors
database for employee role and for users max and alex.
Define the valid privileges a role or user has for working with servers. You can specify any combination of privileges or specify all privileges.
This clause requires superuser privileges, or <serverName> must have been created by the user invoking the command.
<privilegeList>
<serverName>
Name of the server, which must exist on the current database, created by CREATE SERVER ON DATABASE.
<entityList>
Grant DROP privilege on server parquet_s3_server
to user fred:
Grant ALTER privilege on server parquet_s3_server
to role payrollDept:
Grant USAGE and ALTER privileges on server parquet_s3_server
to role payrollDept and user jamie:
Remove privileges a role or user has for working with servers. You can specify any combination of privileges or specify all privileges.
This clause requires superuser privileges, or <serverName> must have been created by the user invoking the command.
<privilegeList>
<serverName>
Name of the server, which must exist on the current database, created by CREATE SERVER ON DATABASE.
<entityList>
Revoke DROP privilege on server parquet_s3_server
for user inga:
Grant ALTER privilege on server parquet_s3_server
for role payrollDept:
Grant USAGE and ALTER privileges on server parquet_s3_server
for role payrollDept and user marvin:
Define the valid privileges a role or user has for working with dashboards. You can specify any combination of privileges or specify all privileges.
This clause requires superuser privileges.
<privilegeList>
<dashboardId>
ID of the dashboard, which must exist, created by CREATE DASHBOARD. To show a list of all dashboards and IDs in heavysql, run the \dash
command when logged in as superuser.
<entityList>
Permit all privileges on the dashboard ID 740
for the payrollDept role.
Permit VIEW-only privilege on dashboard 730
for the hrDept role and user dennis.
Permit EDIT and DELETE privileges on dashboard 740
for the hrDept and accountsPayableDept roles and for user pavan.
Remove privileges a role or user has for working with dashboards. You can specify any combination of privileges, or all privileges.
This clause requires superuser privileges.
<privilegeList>
<dashboardId>
ID of the dashboard, which must exist, created by CREATE DASHBOARD.
<entityList>
Revoke DELETE privileges on dashboard 740
for the payrollDept role.
Revoke all privileges on dashboard 730
for hrDept role and users dennis and mike.
Revoke EDIT and DELETE of dashboard 740
for the hrDept and accountsPayableDept roles and for users dante and jonathan.
The following privilege levels are typically recommended for non-superusers in Immerse. Privileges assigned for users in your organization may vary depending on access requirements.
These examples assume that tables table1
through table4
are created as needed:
The following examples show how to work with users, roles, tables, and dashboards.
Use the \dash
command to list all dashboards and their unique IDs in HEAVY.AI:
Here, the Marketing_Summary
dashboard uses table2
as a data source. The role marketingDeptRole2
has select privileges on that table. Grant view access on the Marketing_Summary
dashboard to marketingDeptRole2
:
The following table shows the roles and privileges for each user created in the previous example.
Use the following commands to list current roles and assigned privileges. If you have superuser access, you can see privileges for all users. Otherwise, you can see only those roles and privileges for which you have access.
Results for users, roles, privileges, and object privileges are returned in creation order.
Lists all dashboards and dashboard IDs in HEAVY.AI. Requires superuser privileges. Dashboard privileges are assigned by dashboard ID because dashboard names may not be unique.
Example
heavysql> \dash database heavyai
Dashboard ID | Dashboard Name | Owner
1 | Marketing_Summary | heavyai
Reports all privileges granted to the specified object for all roles and users. If the specified objectName does not exist, no results are reported. Used for databases and tables only.
Example
Reports all object privileges granted to the specified role or user. The roleName or userName specified must exist.
Example
Reports all roles granted to the given user. The userName specified must exist.
Example
Reports all roles.
Example
Lists all users.
Example
The following example demonstrates field-level security using two views:
view_users_limited
, in which users only see three of seven fields: userid
, First_Name
, and Department
.
view_users_full
, users see all seven fields.
User readonly1
sees no tables, only the specific view granted, and only the three specific columns returned in the view:
User readonly2
sees no tables, only the specific view granted, and all seven columns returned in the view:
is a distributed streaming platform. It allows you to create publishers, which create data streams, and consumers, which subscribe to and ingest the data streams produced by publishers.
You can use HeavyDB C++ program to consume a topic created by running Kafka shell scripts from the command line. Follow the procedure below to use a Kafka producer to send data, and a Kafka consumer to store the data, in HeavyDB.
This example assumes you have already installed and configured Apache Kafka. See the .
Create a sample topic for your Kafka producer.
Run the kafka-topics.sh
script with the following arguments:
Create a file named myfile
that consists of comma-separated data. For example:
Use heavysql
to create a table to store the stream.
Load your file into the Kafka producer.
Create and start a producer using the following command.
Load the data to HeavyDB using the Kafka console consumer and the KafkaImporter
program.
Pull the data from Kafka into the KafkaImporter
program.
Verify that the data arrived using heavysql
.
HEAVY.AI can accept a set of encrypted credentials for secure authentication of a custom application. This topic provides a method for providing an encryption key to generate encrypted credentials and configuration options for enabling decryption of those encrypted credentials.
Generate a 128- or 256-bit encryption key and save it to a file. You can use to generate a suitable encryption key.
Set the file path of the encryption key file to the encryption-key-file-path
web server parameter in heavyai.conf:
Alternatively, you can set the path using the --encryption-key-file-path=path/to/file
command-line argument.
Generate encrypted credentials for a custom application by running the following Go program, replacing the example key and credentials strings with an actual key and actual credentials. You can also run the program in a web browser at .
Follow these instructions to start an HEAVY.AI server with an encrypted main port.
You need the following PKI (Public Key Infrastructure) components to implement a Secure Binary Interface.
A CRT (short for certificate) file containing the server's PKI certificate. This file must be shared with the clients that connect using encrypted communications. Ideally, this file is signed by a recognized certificate issuing agency.
A key file containing the server's private key. Keep this file secret and secure.
A Java TrustStore containing the server's PKI certificate. The password for the trust store is also required.
Although in this instance the trust store contains only information that can be shared, the Java TrustStore program requires it to be password protected.
A Java KeyStore and password.
In a distributed system, add the configuration parameters to the heavyai.conf file on the aggregator and all leaf nodes in your HeavyDB cluster.
You can use OpenSSL utilities to create the various PKI elements. The server certificate in this instance is self-signing, and should not be used in a production system.
Generate a new private key.
Use the private key to generate a certificate signing request.
Self sign the certificate signing request to create a public certificate.
Use the Java tools to create a key store from the public certificate.
To generate a keystore file from your server key:
Copy server.key to server.txt. Concatenate it with server.crt.
Use server.txt to create a PKCS12 file.
Use server.p12 to create a keystore.
Start the server using the following options.
Alternatively, you can add the following configuration parameters to heavyai.conf to establish a Secure Binary Interface. The following configuration flags implement the same encryption shown in the runtime example above:
Passwords for the SSL truststore and keystore can be enclosed in single (') or double (") quotes.
The server.crt
file and the Java truststore contain the same public key information in different formats. Both are required by the server to establish both the secure client communication with the various interfaces and with its Calcite server. At startup, the Java truststore is passed to the Calcite server for authentication and to encrypt its traffic with the HEAVY.AI server.
HEAVY.AI supports LDAP authentication using an IPA Server or Microsoft Active Directory.
You can configure HEAVY.AI Enterprise edition to map LDAP roles 1-to-1 to HEAVY.AI roles. When you enable this mapping, LDAP becomes the main authority controlling user roles in HEAVY.AI.
LDAP mapping is available only in HEAVY.AI Enterprise edition.
HEAVY.AI supports five configuration settings that allow you to integrate with your LDAP server.
To find the ldap-role-query-url
and ldap-role-query-regex
to use, query your user roles. For example, if there is a user named kiran on the IPA LDAP server ldap://myldapserver.mycompany.com
, you could use the following curl command to get the role information:
When successful, it returns information similar to the following:
ldap-dn
matches the DN, which is uid=kiran,cn=users,cn=accounts,dc=mycompany,dc=com
.
ldap-role-query-url
includes the LDAP URI + the DN + the LDAP attribute that represents the role/group the member belongs to, such as memberOf.
ldap-role-query-regex
is a regular expression that matches the role names. The matching role names are used to grant and revoke privileges in HEAVY.AI. For example, if we created some roles on an IPA LDAP server where the role names begin with MyCompany_ (for example, MyCompany_Engineering, MyCompany_Sales, MyCompany_SuperUser), the regular expression can filter the role names using MyCompany_.
ldap-superuser-role
is the role/group name for HEAVY.AI users who are superusers once they log on to the HEAVY.AI database. In this example, the superuser role name is MyCompany_SuperUser.
Make sure that LDAP configuration appears before the [web]
section of heavy.conf
.
Double quotes are not required for LDAP properties in heavy.conf
. For example, both of the following are valid:
ldap-uri = "ldap://myldapserver.mycompany.com"
ldap-uri = ldap://myldapserver.mycompany.com
To integrate LDAP with HEAVY.AI, you need the following:
A functional LDAP server, with all users/roles/groups created (ldap-uri
, ldap-dn
, ldap-role-query-url
, ldap-role-query-regex
, and ldap-superuser-role
) to be used by HEAVY.AI. You can use the curl
command to test and find the filters.
A functional HEAVY.AI server, version 4.1 or higher.
Once you have your server information, you can configure HEAVY.AI to use LDAP authentication.
Locate the heavy.conf
file and edit it to include the LDAP parameter. For example:
Restart the HEAVY.AI server:
Log on to heavysql
as MyCompany user, or any user who belongs to one of the roles/groups that match the filter.
When you use LDAP authentication, the default admin user and password HyperInteractive do not work unless you create the admin user with the same password on the LDAP server.
If your login fails, inspect $HEAVYAI_STORAGE/mapd_log/heavyai_server.INFO
to check for any obvious errors about LDAP authentication.
Once you log in, you can create a new role name in heavysql
, and then apply GRANT/REVOKE privileges to the role. Log in as another user with that role and confirm that GRANT/REVOKE works.
If you refresh the browser window, you are required to log in and reauthenticate.
To use LDAPS, HEAVY.AI must trust the LDAP server's SSL certificate. To achieve this, you must have the CA for the server's certificate, or the server certificate itself. Install the certificate as a trusted certificate.
To use IPA as your LDAP server with HEAVY.AI running on CentOS 7:
Copy the IPA server CA certificate to your local machine.
Update the PKI certificates.
Edit /etc/openldap/ldap.conf
to add the following line.
Locate the heavy.conf
file and edit it to include the LDAP parameter. For example:
Restart the HEAVY.AI server:
To use IPA as your LDAP server with HEAVY.AI running on Ubuntu:
Copy the IPA server CA certificate to your local machine.
Rename ipa-ca.crm
to ipa-ca.crt
so that the certificates bundle update script can find it:
Update the PKI certificates:
Edit /etc/openldap/ldap.conf
to add the following line:
Locate the heavy.conf
file and edit it to include the LDAP parameter. For example:
Restart the HEAVY.AI server:
1. Locate the heavy.conf
file and edit it to include the LDAP parameter.
Example 1:
Example 2:
2. Restart the HEAVY.AI server:
Other LDAP user authentication attributes, such as userPrincipalName, are not currently supported.
Security Assertion Markup Language (SAML) is used for exchanging authentication and authorization data between security domains. SAML uses security tokens containing assertions (statements that service providers use to make decisions about access control) to pass information about a principal (usually an end user) between a SAML authority, named an Identity Provider (IdP), and a SAML consumer, named a Service Provider (SP). SAML enables web-based, cross-domain, single sign-on (SSO), which helps reduce the administrative overhead of sending multiple authentication tokens to the user.
If you use SAML for authentication to HEAVY.AI, and SAML login fails, HEAVY.AI automatically falls back to log in using LDAP if it is configured.
If both SAML and LDAP authentication fail, you are authenticated against a locally stored password, but only if the allow-local-auth-fallback
flag is set.
These instructions use as the IdP and HEAVY.AI as the SP in an SP-initiated workflow, similar to the following:
A user uses a login page to connect to HEAVY.AI.
The HEAVY.AI login page redirects the user to the Okta login page.
The user signs in using an Okta account. (This step is skipped if the user is already logged in to Okta.)
Okta returns a base64-encoded SAML Response to the user, which contains a SAML Assertion that the user is allowed to use HEAVY.AI. If configured, it also returns a list of SAML Groups assigned to the user.
Okta redirects the user to the HEAVY.AI login page together with the SAML response (a token).
HEAVY.AI verifies the token, and retrieves the user name and groups. Authentication and authorization is complete.
In addition to Okta, the following SAML providers are also supported:
1) Log into your Okta account and click the Admin button.
2) From the Applications menu, select Applications.
3) Click the Add Application button.
4) On the Add Application screen, click Create New App.
5) On the Create a New Application Integration page, set the following details:
Platform: Web
Sign on Method: SAML 2.0
And then, click Create.
6) On the Create SAML Integration page, in the App name field, type Heavyai and click Next.
7) In the SAML Settings page, enter the following information:
Audience URI (SP Entity ID): Your Heavy Immerse web URL with the suffix saml-post.
Default RelayState: Forward slash (/).
Application username: HEAVY.AI recommends using the email address you used to log in to Okta.
Leave other settings at their default values, or change as required for your specific installation.
After making your selections, click Next.
8) In the Help Okta Support... page, click I'm an Okta customer adding an internal app. All other questions on this page are optional.
After making your selections, click Finish.
Your application is now registered and displayed, and the Sign On tab is selected.
Before configuring SAML, make sure that HTTPS is enabled on your web server.
On the Sign On tab, configure SAML settings for your application:
1) On the Settings page, click View Setup Instructions.
2) On the How to Configure SAML 2.0 for HEAVY.AI Application page, scroll to the bottom, copy the XML fragment in the Provide the following IDP metadata to your SP provider box, and save it as a raw text file called idp.xml.
3) Upload idp.xml to your HEAVY.AI server in $HEAVYAI_STORAGE.
4) Edit heavy.conf and add the following configuration parameters:
saml-metadata-file
: Path to the idp.xml file you created.
saml-sp-target-url
: Web URL to your Heavy Immerse saml-post endpoint.
saml-signed-assertion
: Boolean value that determines whether Okta signs the assertion; true by default.
saml-signed-response
: Boolean value that determines whether Okta signs the response; true by default.
For example:
In the web section, add the full physical path to the servers.json file; for example:
5) On the How to Configure SAML 2.0 for HEAVY.AI Application page, copy the Identity Provider Single Sign-On URL, which looks similar to this:
6) If the servers.json file you identified in the [web] section of heavy.conf does not exist, create it. In servers.json, include the SAMLurl property, using the same value you copied in Identify Provider Single Sign-On URL. For example:
7) Restart the heavyai_server and heavyai_web_server services.
Users can be automatically created in HEAVY.AI based on group membership:
1) Go to the Application Configuration page for the HEAVY.AI application in Okta.
2) On the General tab, scroll to the SAML Settings section and click the Edit button.
3) Click the Next button, and then in the Group Attribute Statements section, set the following:
Name: Groups
Filter: Set to the desired filter type to determine the set of groups delivered to HEAVY.AI through the SAML response. In the text box next to the Filter type drop-down box, enter the text that defines the filter.
Click Next, and then click Finish.
Any group that requires access to HEAVY.AI must be created in HEAVY.AI before users can log in.
Modify your heavyai.conf file by adding the following parameter:
The heavyai.conf entries now look like this:
Restart the heavyai_server and heavyai_web_server processes.
Users whose group membership in Okta contains a group name that exists in HeavyDB can log in and have the privileges assigned to their groups.
1) On the Okta website, on the Assignments tab, click Assign > Assign to People.
2) On the Assign HEAVY.AI to People panel, click the Assign button next to users that you want to provide access to HEAVY.AI.
3) Click Save and Go Back to assign HEAVY.AI to the user.
) Repeat steps 2 and 3 for all users to whom you want to grant access. Click Done when you are finished.
Verify that the SAML is configured correctly by opening your Heavy Immerse login page. You should be automatically redirected to the Okta login page, and then back to Immerse, without entering credentials.
When you log out of Immerse, you see the following screen:
Logging out of Immerse does not log you out of Okta. If you log back in to Immerse and are still logged in to Okta, you do not need to reathenticate.
If authentication fails, you see this error message when you attempt to log in through Okta:
To resolve the authentication error:
Add the license information by either:
Adding heavyai.license to your HEAVY.AI data directory.
Logging in to HeavyDB and run the following command:
Reattempt login through Okta.
The Information about authentication errors can be found in the log files.
Following are the parameters for runtime settings on HeavyAI Web Server. The parameter syntax provides both the implied value and the default value as appropriate. Optional arguments are in square brackets, while implied and default values are in parentheses.
Enables all join queries to fall back to the loop join implementation. During a loop join, queries loop over all rows from all tables involved in the join, and evaluate the join condition. By default, loop joins are only allowed if the number of rows in the inner table is fewer than the , since loop joins are computationally expensive and run for an extended period. Modifying the trivial-loop-join-threshold is a safer alternative to globally enabling loop joins. You might choose to globally enable loop joins when you have many small tables for which loop join performance has been determined to be acceptable but modifying the trivial join loop threshold would be tedious.
Path to file containing HEAVY.AI queries. Use a query list to autoload data to GPU memory on startup to speed performance. See .
Enable filter push-down through joins. Evaluates filters in the query expression for selectivity and pushes down highly selective filters into the join according to selectivity parameters. See also
Enable the runtime query interrupt. Enables runtime query interrupt. Setting to TRUE can reduce performance slightly. Use with to set the interrupt frequency.
Symbolic link to the active log. Creates a symbolic link for every severity greater than or equal to the configuration option.
Number of GPUs to use. In a shared environment, you can assign the number of GPUs to a particular application. The default, -1, uses all available GPUs. Use in conjunction with .
Path to the file containing trusted CA certificates; for PKI authentication. Used to validate certificates submitted by clients. If the certificate provided by the client (in the password
field of the connect
command) was not signed by one of the certificates in the trusted file, then the connection fails.
PKI authentication works only if the server is configured to encrypt connections via TLS. The common name extracted from the client certificate is used as the name of the user to connect. If this name does not already exist, the connection fails. If LDAP or SAML are also enabled, the servers fall back to these authentication methods if PKI authentication fails.
Currently works only with clients. To allow connection from other clients, set allow-local-auth-fallback
or add LDAP/SAML authentication.
First GPU to use. Used in shared environments in which the first assigned GPU is not GPU 0. Use in conjunction with .
Compressor algorithm to be used by the server to compress data being transferred between server. See for compression algorithm options.
See for a more complete example.
E
Begin by adding your SAML application in Okta. If you do not have an Okta account, you can sign up on the .
Single sign on URL: Your Heavy Immerse web URL with the suffix saml-post; for example, . Select the Use this for Recipient URL and Destination URL checkbox.
User accounts assigned to the HEAVY.AI application in Okta must exist in HEAVY.AI before a user can log in. To have users created automatically based on their group membership, see .
SQL
Description
Create role.
Drop role.
Grant role to user or to another role.
Revoke role from user or from another role.
Grant role privilege(s) on a database table to a role or user.
Revoke role privilege(s) on database table from a role or user.
Grant role privilege(s) on a database view to a role or user.
Revoke role privilege(s) on database view from a role or user.
Grant role privilege(s) on database to a role or user.
Revoke role privilege(s) on database from a role or user.
Grant role privilege(s) on server to a role or user.
Revoke role privilege(s) on server from a role or user.
Grant role privilege(s) on dashboard to a role or user.
Revoke role privilege(s) on dashboard from a role or user.
Parameter Value
Descriptions
ALL
Grant all possible access privileges on <tableName> to <entityList>.
ALTER TABLE
Grant ALTER TABLE privilege on <tableName> to <entityList>.
DELETE
Grant DELETE privilege on <tableName> to <entityList>.
DROP
Grant DROP privilege on <tableName> to <entityList>.
INSERT
Grant INSERT privilege on <tableName> to <entityList>.
SELECT
Grant SELECT privilege on <tableName> to <entityList>.
TRUNCATE
Grant TRUNCATE privilege on <tableName> to <entityList>.
UPDATE
Grant UPDATE privilege on <tableName> to <entityList>.
Parameter Value
Descriptions
role
Name of role.
user
Name of user.
Parameter Value
Descriptions
ALL
Remove all access privilege for <entityList> on <tableName>.
ALTER TABLE
Remove ALTER TABLE privilege for <entityList> on <tableName>.
DELETE
Remove DELETE privilege for <entityList> on <tableName>.
DROP
Remove DROP privilege for <entityList> on <tableName>.
INSERT
Remove INSERT privilege for <entityList> on <tableName>.
SELECT
Remove SELECT privilege for <entityList> on <tableName>.
TRUNCATE
Remove TRUNCATE privilege for <entityList> on <tableName>.
UPDATE
Remove UPDATE privilege for <entityList> on <tableName>.
Parameter Value
Descriptions
role
Name of role.
user
Name of user.
Parameter Value
Descriptions
ALL
Grant all possible access privileges on <viewName> to <entityList>.
DROP
Grant DROP privilege on <viewName> to <entityList>.
INSERT
Grant INSERT privilege on <viewName> to <entityList>.
SELECT
Grant SELECT privilege on <viewName> to <entityList>.
Parameter Value
Descriptions
role
Name of role.
user
Name of user.
Parameter Value
Descriptions
ALL
Remove all access privilege for <entityList> on <viewName>.
DROP
Remove DROP privilege for <entityList> on <viewName>.
INSERT
Remove INSERT privilege for <entityList> on <viewName>.
SELECT
Remove SELECT privilege for <entityList> on <viewName>.
Parameter Value
Descriptions
role
Name of role.
user
Name of user.
Parameter Value
Descriptions
ACCESS
Grant ACCESS (connection) privilege on <dbName> to <entityList>.
ALL
Grant all possible access privileges on <dbName> to <entityList>.
ALTER TABLE
Grant ALTER TABLE privilege on <dbName> to <entityList>.
ALTER SERVER
Grant ALTER SERVER privilege on <dbName> to <entityList>.
CREATE SERVER
Grant CREATE SERVER privilege on <dbName> to <entityList>;
CREATE TABLE
Grant CREATE TABLE privilege on <dbName> to <entityList>. Previously CREATE
.
CREATE VIEW
Grant CREATE VIEW privilege on <dbName> to <entityList>.
CREATE DASHBOARD
Grant CREATE DASHBOARD privilege on <dbName> to <entityList>.
CREATE
Grant CREATE privilege on <dbName> to <entityList>.
DELETE
Grant DELETE privilege on <dbName> to <entityList>.
DELETE DASHBOARD
Grant DELETE DASHBOARD privilege on <dbName> to <entityList>.
DROP
Grant DROP privilege on <dbName> to <entityList>.
DROP SERVER
Grant DROP privilege on <dbName> to <entityList>.
DROP VIEW
Grant DROP VIEW privilege on <dbName> to <entityList>.
EDIT DASHBOARD
Grant EDIT DASHBOARD privilege on <dbName> to <entityList>.
INSERT
Grant INSERT privilege on <dbName> to <entityList>.
SELECT
Grant SELECT privilege on <dbName> to <entityList>.
SELECT VIEW
Grant SELECT VIEW privilege on <dbName> to <entityList>.
SERVER USAGE
Grant SERVER USAGE privilege on <dbName> to <entityList>.
TRUNCATE
Grant TRUNCATE privilege on <dbName> to <entityList>.
UPDATE
Grant UPDATE privilege on <dbName> to <entityList>.
VIEW DASHBOARD
Grant VIEW DASHBOARD privilege on <dbName> to <entityList>.
VIEW SQL EDITOR
Grant VIEW SQL EDITOR privilege in Immerse on <dbName> to <entityList>.
Parameter Value
Descriptions
role
Name of role, which must exist.
user
Name of user, which must exist. See Users and Databases.
Parameter Value
Descriptions
ACCESS
Remove ACCESS (connection) privilege on <dbName> from <entityList>.
ALL
Remove all possible privileges on <dbName> from <entityList>.
ALTER SERVER
Remove ALTER SERVER privilege on <dbName> from <entityList>
ALTER TABLE
Remove ALTER TABLE privilege on <dbName> from <entityList>.
CREATE TABLE
Remove CREATE TABLE privilege on <dbName> from <entityList>. Previously CREATE
.
CREATE VIEW
Remove CREATE VIEW privilege on <dbName> from <entityList>.
CREATE DASHBOARD
Remove CREATE DASHBOARD privilege on <dbName> from <entityList>.
CREATE
Remove CREATE privilege on <dbName> from <entityList>.
CREATE SERVER
Remove CREATE SERVER privilege on <dbName> from <entityList>.
DELETE
Remove DELETE privilege on <dbName> from <entityList>.
DELETE DASHBOARD
Remove DELETE DASHBOARD privilege on <dbName> from <entityList>.
DROP
Remove DROP privilege on <dbName> from <entityList>.
DROP SERVER
Remove DROP SERVER privilege on <dbName> from <entityList>.
DROP VIEW
Remove DROP VIEW privilege on <dbName> from <entityList>.
EDIT DASHBOARD
Remove EDIT DASHBOARD privilege on <dbName> from <entityList>.
INSERT
Remove INSERT privilege on <dbName> from <entityList>.
SELECT
Remove SELECT privilege on <dbName> from <entityList>.
SELECT VIEW
Remove SELECT VIEW privilege on <dbName> from <entityList>.
SERVER USAGE
Remove SERVER USAGE privilege on <dbName> from <entityList>.
TRUNCATE
Remove TRUNCATE privilege on <dbName> from <entityList>.
UPDATE
Remove UPDATE privilege on <dbName> from <entityList>.
VIEW DASHBOARD
Remove VIEW DASHBOARD privilege on <dbName> from <entityList>.
VIEW SQL EDITOR
Remove VIEW SQL EDITOR privilege in Immerse on <dbName> from <entityList>.
Parameter Value
Descriptions
role
Name of role.
user
Name of user.
Parameter Value
Descriptions
DROP
Grant DROP privileges on <serverName> on current database to <entityList>.
ALTER
Grant ALTER privilege on <serverName> on current database to <entityList>.
USAGE
Grant USAGE privilege (through foreign tables) on <serverName> on current database to <entityList>.
Parameter Value
Descriptions
role
Name of role, which must exist.
user
Name of user, which must exist. See Users and Databases.
Parameter Value
Descriptions
DROP
Remove DROP privileges on <serverName> on current database for <entityList>.
ALTER
Remove ALTER privilege on <serverName> on current database for <entityList>.
USAGE
Remove USAGE privilege (through foreign tables) on <serverName> on current database for <entityList>.
Parameter Value
Descriptions
role
Name of role, which must exist.
user
Name of user, which must exist. See Users and Databases.
Parameter Value
Descriptions
ALL
Grant all possible access privileges on <dashboardId> to <entityList>.
CREATE
Grant CREATE privilege to <entityList>.
DELETE
Grant DELETE privilege on <dashboardId> to <entityList>.
EDIT
Grant EDIT privilege on <dashboardId> to <entityList>.
VIEW
Grant VIEW privilege on <dashboardId> to <entityList>.
Parameter Value
Descriptions
role
Name of role, which must exist.
user
Name of user, which must exist. See Users and Databases.
Parameter Value
Descriptions
ALL
Revoke all possible access privileges on <dashboardId> for <entityList>.
CREATE
Revoke CREATE privilege for <entityList>.
DELETE
Revoke DELETE privilege on <dashboardId> for <entityList>.
EDIT
Revoke EDIT privilege on <dashboardId> for <entityList>.
VIEW
Revoke VIEW privilege on <dashboardId> for <entityList>.
Parameter Value
Descriptions
role
Name of role, which must exist.
user
Name of user, which must exist. See Users and Databases.
Privilege
Command Syntax to Grant Privilege
Access a database
GRANT ACCESS ON DATABASE <dbName> TO <entityList>;
Create a table
GRANT CREATE TABLE ON DATABASE <dbName> TO <entityList>;
Select a table
GRANT SELECT ON TABLE <tableName> TO <entityList>;
View a dashboard
GRANT VIEW ON DASHBOARD <dashboardId> TO <entityList>;
Create a dashboard
GRANT CREATE DASHBOARD ON DATABASE <dbName> TO <entityList>;
Edit a dashboard
GRANT EDIT ON DASHBOARD TO ;
Delete a dashboard
GRANT DELETE DASHBOARD ON DATABASE <dbName> TO <entityList>;
User
Roles Granted
Table Privileges
salesDeptEmployee1
salesDeptRole1
SELECT on Tables 1, 3
salesDeptEmployee2
salesDeptRole2
SELECT on Table 3
salesDeptEmployee3
salesDeptRole2
SELECT on Table 3
salesDeptEmployee4
salesDeptRole3
SELECT on Table 4
salesDeptManagerEmployee5
salesDeptRole1, salesDeptRole2, salesDeptRole3
SELECT on Tables 1, 3, 4
marketingDeptEmployee1
marketingDeptRole1
SELECT on Tables 1, 2
marketingDeptEmployee2
marketingDeptRole2
SELECT on Table 2
marketingDeptManagerEmployee3
marketingDeptRole1, marketingDeptRole2, salesDeptRole1, salesDeptRole2, salesDeptRole3
SELECT on Tables 1, 2, 3, 4
Flag
Description
Default
additional-file-upload-extensions <string>
Denote additional file extensions for uploads. Has no effect if --enable-upload-extension-check
is not set.
allow-any-origin
Allows for a CORS exception to the same-origin policy. Required to be true if Immerse is hosted on a different domain or subdomain hosting heavy_web_server and heavydb.
Allowing any origin is a less secure mode than what heavy_web_server requires by default.
--allow-any-origin = false
-b | backend-url <string>
URL to http-port on heavydb. Change to avoid collisions with other services.
http://localhost:6278
-B | binary-backend-url <string>
URL to http-binary-port on heavydb.
http://localhost:6276
cert string
Certificate file for HTTPS. Change for testing and debugging.
cert.pem
-c | config <string>
Path to HeavyDB configuration file. Change for testing and debugging.
-d | data <string>
Path to HeavyDB data directory. Change for testing and debugging.
data
data-catalog <string>
Path to data catalog directory.
n/a
docs string
Path to documentation directory. Change if you move your documentation files to another directory.
docs
enable-binary-thrift
Use the binary thrift protocol.
TRUE[1]
enable-browser-logs [=arg]
Enable access to current log files via web browser. Only super users (while logged in) can access log files.
Log files are available at http[s]://host:port/logs/log_name.
The web server log files: ACCESS - http[s]://host:port/logs/access ALL - http[s]://host:port/logs/all
HeavyDB log files: INFO - http[s]://host:port/logs/info WARNING - http[s]://host:port/logs/warning ERROR - http[s]://host:port/logs/
FALSE[0]
enable-cert-verification
TLS certificate verification is a security measure that can be disabled for the cases of TLS certificates not issued by a trusted certificate authority. If using a locally or unofficially generated TLS certificate to secure the connection between heavydb and heavy_web_server, this parameter must be set to false. heavy_web_server expects a trusted certificate authority by default.
--enable-cert-verification = true
enable-cross-domain [=arg]
Enable frontend cross-domain authentication. Cross-domain session cookies require the SameSite = None; Secure
headers. Can only be used with HTTPS domains; requires enable-https
to be true.
FALSE[0]
enable-https
Enable HTTPS support. Change to enable secure HTTP.
enable-https-authentication
Enable PKI authentication.
enable-https-redirect [=arg]
Enable a new port that heavy_web_server listens on for incoming HTTP requests. When received, it returns a redirect response to the HTTPS port and protocol, so that browsers are immediately and transparently redirected. Use to provide an HEAVY.AI front end that can run on both the HTTP protocol (http://my-heavyai-frontend.com) on default HTTP port 80, and on the primary HTTPS protocol (https://my-heavyai-frontend.com) on default https port 443, and have requests to the HTTP protocol automatically redirected to HTTPS. Without this, requests to HTTP fail. Assuming heavy_web_server can attach to ports below 1024, the configuration would be: enable-https-redirect = TRUE http-to-https-redirect-port = 80
FALSE[0]
enable-non-kernel-time-query-interrupt
Enable non-kernel-time query interrupt.
TRUE[1]
enable-runtime-query-interrupt
Enbale runtime query interrupt.
TRUE[1]
enable-upload-extension-check
Disables restrictive file extension upload check.
encryption-key-file-path <string>
Path to the file containing the credential payload cipher key. Key must be 256 bits in length.
-f | frontend string
Path to frontend directory. Change if you move the location of your frontend UI files.
frontend
http-to-https-redirect-port = arg
Configures the http (incoming) port used by enable-https-redirect. The port option specifies the redirect port number. Use to provide an HEAVY.AI front end that can run on both the HTTP protocol (http://my-heavyai-frontend.com) on default HTTP port 80, and on the primary HTTPS protocol (https://my-heavyai-frontend.com) on default https port 443, and have requests to the HTTP protocol automatically redirected to HTTPS. Without this, requests to HTTP fail. Assuming heavy_web_server can attach to ports below 1024, the configuration would be: enable-https-redirect = TRUE http-to-https-redirect-port = 80
6280
idle-session-duration = arg
Idle session default, in minutes.
60
jupyter-prefix-string <string>
Jupyter Hub base_url for Jupyter integration.
/jupyter
jupyter-url-string <string>
URL for Jupyter integration.
-j |jwt-key-file
Path to a key file for client session encryption.
The file is expected to be a PEM-formatted ( .pem ) certificate file containing the unencrypted private key in PKCS #1, PCKS #8, or ASN.1 DER form.
Example PEM file creation using OpenSSL.
Required only if using a high-availability server configuration or another server configuration that requires an instance of Immerse to talk to multiple heavy_web_server instances.
Each heavy_web_server instance needs to use the same encryption key to encrypt and decrypt client session information which is used for session persistence ("sessionization") in Immerse.
key <string>
Key file for HTTPS. Change for testing and debugging.
key.pem
max-tls-version
Refers to the version of TLS encryption used to secure web protocol connections. Specifies a maximum TLS version.
min-tls-version
Refers to the version of TLS encryption used to secure web protocol connections. Specifies a minimum TLS version.
--min-tls-version = VersionTLS12
peer-cert <string>
Peer CA certificate PKI authentication.
peercert.pem
-p | port int
Frontend server port. Change to avoid collisions with other services.
6273
-r | read-only
Enable read-only mode. Prevent changes to the data.
secure-acao-uri
If set, ensures that all Access-Allow-Origin
headers are set to the value provided.
servers-json <string>
Path to servers.json. Change for testing and debugging.
session-id-header <string>
Session ID header.
immersesid
ssl-cert <string>
SSL validated public certificate.
sslcert.pem
ssl-private-key <string>
SSL private key file.
sslprivate.key
strip-x-headers <strings>
List of custom X http request headers to be removed from incoming requests. Use --strip-x-headers=""
to allow all X headers through.
[X-HeavyDB-Username]
timeout duration
Maximum request duration in #h#m#s
format. For example 0h30m0s
represents a duration of 30 minutes. Controls the maximum duration of individual HTTP requests. Used to manage resource exhaustion caused by improperly closed connections.
This also limits the execution time of queries made over the Thrift HTTP transport. Increase the duration if queries are expected to take longer than the default duration of one hour; for example, if you COPY FROM a large file when using heavysql with the HTTP transport.
1h0m0s
tls-cipher-suites <strings>
Refers to the combination of algorithms used in TLS encryption to secure web protocol connections.
All available TLS cipher suites compatible with HTTP/2:
TLS_RSA_WITH_RC4_128_SHA
TLS_RSA_WITH_AES_128_CBC_SHA
TLS_ECDHE_RSA_WITH_AES_128_
GCM_SHA256
TLS_ECDHE_ECDSA_WITH_AES_128_
GCM_SHA256
TLS_ECDHE_RSA_WITH_AES_256_
GCM_SHA384
TLS_ECDHE_ECDSA_WITH_AES_256_
GCM_SHA384
TLS_ECDHE_RSA_WITH_CHACHA20_
POLY1305
TLS_ECDHE_ECDSA_WITH_CHACHA20_
POLY1305
TLS_AES_128_GCM_SHA256
TLS_AES_256_GCM_SHA384
TLS_CHACHA20_POLY1305_SHA256
TLS_FALLBACK_SCSV
<code></code>
Limit security vulnerabilities by specifying the allowed TLS ciphers in the encryption used to secure web protocol connections.
The following cipher suites are accepted by default:
TLS_ECDHE_RSA_WITH_AES_128_
GCM_SHA256
TLS_ECDHE_ECDSA_WITH_AES_128_
GCM_SHA256
TLS_ECDHE_RSA_WITH_AES_256_
GCM_SHA384
TLS_RSA_WITH_AES_256_GCM_
SHA384
tls-curves <strings>
Refers to the types of Elliptic Curve Cryptography (ECC) used in TLS encryption to secure web protocol connections.
All available TLS elliptic Curve IDs:
secp256r1
(Curve ID P256)
CurveP256
(Curve ID P256)
secp384r1
(Curve ID P384)
CurveP384
(Curve ID P384)
secp521r1
(Curve ID P521)
CurveP521
(Curve ID P521)
x25519
(Curve ID X25519)
X25519
(Curve ID X25519)
Limit security vulnerabilities by specifying the allowed TLS cipher suites in the encryption used to secure web protocol connections.
The following TLS curves are accepted by default:
CurveP521
CurveP384
CurveP256
tmpdir string
Path for temporary file storage. Used as a staging location for file uploads. Consider locating this directory on the same file system as the HEAVY.AI data directory. If not specified on the command line, heavyai_web_server
recognizes the standard TMPDIR
environment variable as well as a specific HEAVYAI_TMPDIR
environment variable, the latter of which takes precedence. If you use neither the command-line argument nor one of the environment variables, the default, /tmp/
is used.
/tmp
ultra-secure-mode
Enables secure mode that sets Access-Allow-Origin
headers to --secure-acao-uri
and sets security headers like X-Frame-Options
, Content-Security-Policy
, and Strict-Transport-Security
.
-v | verbose
Enable verbose logging. Adds log messages for debugging purposes.
version
Return version.
Parameter
Description
Example
ldap-uri
LDAP server host or server URI.
ldap://myLdapServer.myCompany.com
ldap-dn
LDAP distinguished name (DN).
uid=$USERNAME,cn=users,cn=accounts, dc=myCompany,dc=com
ldap-role-query-url
Returns the role names a user belongs to in the LDAP.
ldap://myServer.myCompany.com/uid=$USERNAME, cn=users, cn=accounts,dc=myCompany,dc=com?memberOf
ldap-role-query-regex
Applies a regex filter to find matching roles from the roles in the LDAP server.
(MyCompany_.*?),
ldap-superuser-role
Identifies one of the filtered roles as a superuser role. If a user has this filtered ldap role, the user is marked as a superuser.
MyCompany_SuperUser
<file path>
must be a path on the server. This command exports the results of any SELECT statement to the file. There is a special mode when <file path>
is empty. In that case, the server automatically generates a file in <HEAVY.AI Directory>/export
that is the client session id with the suffix .txt
.
Available properties in the optional WITH clause are described in the following table.
Parameter
Description
Default Value
array_null_handling
Define how to export with arrays that have null elements:
'abort'
- Abort the export. Default.
'raw'
- Export null elements as raw values.
'zero'
- Export null elements as zero (or an empty string).
'nullfield'
- Set the entire array column field to null for that row.
Applies only to GeoJSON and GeoJSONL files.
'abort'
delimiter
A single-character string for the delimiter between column values; most commonly:
,
for CSV files
\t
for tab-delimited files
Other delimiters include |
,~
, ^
, and;
.
Applies to only CSV and tab-delimited files.
Note: HEAVY.AI does not use file extensions to determine the delimiter.
','
(CSV file)
escape
A single-character string for escaping quotes. Applies to only CSV and tab-delimited files.
'
(quote)
file_compression
File compression; can be one of the following:
'none'
'gzip'
'zip'
For GeoJSON and GeoJSONL files, using GZip results in a compressed single file with a .gz extension. No other compression options are currently available.
'none'
file_type
Type of file to export; can be one of the following:
'csv'
- Comma-separated values file.
'geojson'
- FeatureCollection GeoJSON file.
'geojsonl'
- Multiline GeoJSONL file.
'shapefile'
- Geospatial shapefile.
For all file types except CSV, exactly one geo column (POINT, LINESTRING, POLYGON or MULTIPOLYGON) must be projected in the query. CSV exports can contain zero or any number of geo columns, exported as WKT strings.
Export of array columns to shapefiles is not supported.
'csv'
header
Either 'true'
or 'false'
, indicating whether to output a header line for all the column names. Applies to only CSV and tab-delimited files.
'true'
layer_name
A layer name for the geo layer in the file. If unspecified, the stem of the given filename is used, without path or extension.
Applies to all file types except CSV.
Stem of the filename, if unspecified
line_delimiter
A single-character string for terminating each line. Applies to only CSV and tab-delimited files.
'\n'
nulls
A string pattern indicating that a field is NULL. Applies to only CSV and tab-delimited files.
An empty string, 'NA'
, or \N
quote
A single-character string for quoting a column value. Applies to only CSV and tab-delimited files.
"
(double quote)
quoted
Either 'true'
or 'false'
, indicating whether all the column values should be output in quotes. Applies to only CSV and tab-delimited files.
'true'
When using the COPY TO
command, you might encounter the following error:
To avoid this error, use the heavysql
command \cpu
to put your HEAVY.AI server in CPU mode before using the COPY TO
command. See Configuration.
This topic describes several ways to load data to HEAVY.AI using SQL commands.
If there is a potential for duplicate entries, and you want to avoid loading duplicate rows, see How can I avoid creating duplicate rows? on the Troubleshooting page.
If a source file uses a reserved word, HEAVY.AI automatically adds an underscore at the end of the reserved word. For example, year
is converted to year_
.
Use the following syntax for CSV and TSV files:
<file pattern>
must be local on the server. The file pattern can contain wildcards if you want to load multiple files. In addition to CSV, TSV, and TXT files, you can import compressed files in TAR, ZIP, 7-ZIP, RAR, GZIP, BZIP2, or TGZ format.
COPY FROM
appends data from the source into the target table. It does not truncate the table or overwrite existing data.
You can import client-side files (\copy
command in heavysql
) but it is significantly slower. For large files, HEAVY.AI recommends that you first scp
the file to the server, and then issue the COPY command.
HEAVYAI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI.
Available properties in the optional WITH clause are described in the following table.
Parameter
Description
Default Value
array_delimiter
A single-character string for the delimiter between input values contained within an array.
,
(comma)
array_marker
A two-character string consisting of the start and end characters surrounding an array.
{ }
(curly brackets). For example, data to be inserted into a table with a string array in the second column (for example, BOOLEAN, STRING[], INTEGER
) can be written as true,{value1,value2,value3},3
buffer_size
Size of the input file buffer, in bytes.
8388608
delimiter
A single-character string for the delimiter between input fields; most commonly:
,
for CSV files
\t
for tab-delimited files
Other delimiters include |
,~
, ^
, and;
.
Note: OmniSci does not use file extensions to determine the delimiter.
','
(CSV file)
escape
A single-character string for escaping quotes.
'"'
(double quote)
geo
Import geo data. Deprecated and scheduled for removal in a future release.
'false'
header
Either 'true'
or 'false'
, indicating whether the input file has a header line in Line 1 that should be skipped.
'true'
line_delimiter
A single-character string for terminating each line.
'\n'
lonlat
In OmniSci, POINT fields require longitude before latitude. Use this parameter based on the order of longitude and latitude in your source data.
'true'
max_reject
Number of records that the COPY statement allows to be rejected before terminating the COPY command. Records can be rejected for a number of reasons, including invalid content in a field, or an incorrect number of columns. The details of the rejected records are reported in the ERROR log. COPY returns a message identifying how many records are rejected. The records that are not rejected are inserted into the table, even if the COPY stops because the max_reject
count is reached.
Note: If you run the COPY command from OmniSci Immerse, the COPY command does not return messages to Immerse once the SQL is verified. Immerse does not show messages about data loading, or about data-quality issues that result in max_reject
triggers.
100,000
nulls
A string pattern indicating that a field is NULL.
An empty string, 'NA'
, or \N
parquet
Import data in Parquet format. Parquet files can be compressed using Snappy. Other archives such as .gz or .zip must be unarchived before you import the data. Deprecated and scheduled for removal in a future release.
'false'
plain_text
Indicates that the input file is plain text so that it bypasses the libarchive
decompression utility.
CSV, TSV, and TXT are handled as plain text.
quote
A single-character string for quoting a field.
"
(double quote). All characters inside quotes are imported “as is,” except for line delimiters.
quoted
Either 'true'
or 'false'
, indicating whether the input file contains quoted fields.
'true'
source_srid
When importing into GEOMETRY(*, 4326) columns, specifies the SRID of the incoming geometries, all of which are transformed on the fly. For example, to import from a file that contains EPSG:2263 (NAD83 / New York Long Island) geometries, run the COPY command and include WITH (source_srid=2263). Data targeted at non-4326 geometry columns is not affected.
0
source_type='<type>'
Type can be one of the following:
delimited_file
- Import as CSV.
geo_file
- Import as Geo file. Use for shapefiles, GeoJSON, and other geo files. Equivalent to deprecated geo='true'
.
raster_file
- Import as a raster file.
parquet_file
- Import as a Parquet file. Equivalent to deprecated parquet='true'
.
delimited_file
threads
Number of threads for performing the data import.
Number of CPU cores on the system
trim_spaces
Indicate whether to trim side spaces ('true'
) or not ('false'
).
'false'
By default, the CSV parser assumes one row per line. To import a file with multiple lines in a single field, specify threads = 1
in the WITH
clause.
You can use COPY FROM
to import geo files. You can create the table based on the source file and then load the data:
You can also append data to an existing, predefined table:
Use the following syntax, depending on the file source.
Local server
COPY [tableName] FROM '/
filepath
' WITH (source_type='geo_file', ...)
;
Web site
COPY [tableName] FROM '[
http
_https_]://_website/filepath_' WITH (source_type='geo_file', ...);
Amazon S3
COPY [tableName] FROM 's3://
bucket/filepath
' WITH (source_type='geo_file', s3_region='
region
', s3_access_key='
accesskey
', s3_secret_key='
secretkey
', ... );
If you are using COPY FROM
to load to an existing table, the field type must match the metadata of the source file. If it does not, COPY FROM
throws an error and does not load the data.
COPY FROM
appends data from the source into the target table. It does not truncate the table or overwrite existing data.
Supported DATE
formats when using COPY FROM
include mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, and dd/mmm/yyyy
.
COPY FROM
fails for records with latitude or longitude values that have more than 4 decimal places.
The following WITH
options are available for geo file imports from all sources.
geo_assign_render_groups
Enable or disable automatic render group assignment for polygon imports; can be true
or false
. If polygons are not needed for rendering, set this to false
to speed up import.
true
geo_coords_type
Coordinate type used; must be geography
.
N/A
geo_coords_encoding
Coordinates encoding; can be geoint(32)
or none
.
geoint(32)
geo_coords_srid
Coordinates spatial reference; must be 4326
(WGS84 longitude/latitude).
N/A
geo_explode_collections
Explodes MULTIPOLYGON, MULTILINESTRING, or MULTIPOINT geo data into multiple rows in a POLYGON, LINESTRING, or POINT column, with all other columns duplicated.
When importing from a WKT CSV with a MULTIPOLYGON column, the table must have been manually created with a POLYGON column.
When importing from a geo file, the table is automatically created with the correct type of column.
When the input column contains a mixture of MULTI and single geo, the MULTI geo are exploded, but the singles are imported normally. For example, a column containing five two-polygon MULTIPOLYGON rows and five POLYGON rows imports as a POLYGON column of fifteen rows.
false
Currently, a manually created geo table can have only one geo column. If it has more than one, import is not performed.
Any GDAL-supported file type can be imported. If it is not supported, GDAL throws an error.
An ESRI file geodatabase can have multiple layers, and importing it results in the creation of one table for each layer in the file. This behavior differs from that of importing shapefiles, GeoJSON, or KML files, which results in a single table. For more information, see Importing an ESRI File Geodatabase.
The first compatible file in the bundle is loaded; subfolders are traversed until a compatible file is found. The rest of the contents in the bundle are ignored. If the bundle contains multiple filesets, unpack the file manually and specify it for import.
For more information about importing specific geo file formats, see Importing Geospatial Files.
CSV files containing WKT strings are not considered geo files and should not be parsed with the source_type='geo'
option. When importing WKT strings from CSV files, you must create the table first. The geo column type and encoding are specified as part of the DDL. For example, for a polygon with no encoding, try the following:
You can use COPY FROM
to import raster files supported by GDAL as one row per pixel, where a pixel may consist of one or more data bands, with optional corresponding pixel or world-space coordinate columns. This allows the data to be rendered as a point/symbol cloud that approximates a 2D image.
Use the same syntax that you would for geo files, depending on the file source.
The following WITH
options are available for raster file imports from all sources.
raster_import_bands='<bandname>[,<bandname>,...]'
An empty string, indicating to import all bands from all datasets found in the file.
raster_point_transform='<transform>'
Specifies the processing for floating-point coordinate values:
auto
- Transform based on raster file type (world
for geo, none
for non-geo).
none
- No affine or world-space conversion. Values will be equivalent to the integer pixel coordinates.
file
- File-space affine transform only. Values will be in the file's coordinate system, if any (e.g. geospatial).
world
- World-space geospatial transform. Values will be projected to WGS84 lon/lat (if the file has a geospatial SRID).
auto
raster_point_type='<type>'
Specifies the required type for the additional pixel coordinate columns:
auto
- Create columns based on raster file type (double
for geo, int
or smallint
for non-geo, dependent on size).
none
- Do not create pixel coordinate columns.
smallint
or int
- Create integer columns of names raster_x
and raster_y
and fill with the raw pixel coordinates from the file.
float
or double
- Create floating-point columns of names raster_x
and raster_y
(or raster_lon
and raster_lat
) and fill with file-space or world-space projected coordinates.
point
- Create a POINT
column of name raster_point
and fill with file-space or world-space projected coordinates.
auto
Illegal combinations of raster_point_type
and raster_point_transform
are rejected. For example, world
transform can only be performed on raster files that have a geospatial coordinate system in their metadata, and cannot be performed if <type>
is an integer format (which cannot represent world-space coordinate values).
Any GDAL-supported file type can be imported. If it is not supported, GDAL throws an error.
HDF5 and possibly other GDAL drivers may not be thread-safe, so use WITH (threads=1)
when importing.
Archive file import (.zip, .tar, .tar.gz) is not currently supported for raster files.
Band and Column Names
The following raster file formats contain the metadata required to derive sensible names for the bands, which are then used for their corresponding columns:
GRIB2 - geospatial/meteorological format
OME TIFF - an OpenMicroscopy format
The band names from the file are sanitized (illegal characters and spaces removed) and de-duplicated (addition of a suffix in cases where the same band name is repeated within the file or across datasets).
For other formats, the columns are named band_1_1
, band_1_2
, and so on.
The sanitized and de-duplicated names must be used for the raster_import_bands
option.
Band and Column Data Types
Raster files can have bands in the following data types:
Signed or unsigned 8-, 16-, or 32-bit integer
32- or 64-bit floating point
Complex number formats (not supported)
Signed data is stored in the directly corresponding column type, as follows:
int8
-> TINYINT
int16
-> SMALLINT
int32
-> INT
float32
-> FLOAT
float64
-> DOUBLE
Unsigned integer column types are not currently supported, so any data of those types is converted to the next size up signed column type:
uint8
-> SMALLINT
uint16
-> INT
uint32
-> BIGINT
Column types cannot currently be overridden.
ODBC import is currently a beta feature.
You can use COPY FROM
to import data from a Relational Database Management System (RDMS) or data warehouse using the Open Database Connectivity (ODBC) interface.
The following WITH options are available for ODBC import.
data_source_name
Data source name (DSN) configured in the odbc.ini file. Only one of data_source_name
or connection_string
can be specified.
connection_string
A set of semicolon-separated key=value pairs that define the connection parameters for an RDMS. For example:
Driver=DriverName;Database=DatabaseName;Servername=HostName;Port=1234
Only one of data_source_name
or connection_string
can be specified.
sql_order_by
Comma-separated list of column names that provide a unique ordering for the result set returned by the specified SQL SELECT statement.
username
Username on the RDMS. Applies only when data_source_name
is used.
password
Password credential for the RDMS. This option only applies when data_source_name
is used.
credential_string
A set of semicolon separated “key=value” pairs, which define the access credential parameters for an RDMS. For example:
Username=username;Password=password
Applies only when connection_string
is used.
Using a data source name:
Using a connection string:
For information about using ODBC HeavyConnect, see ODBC Data Wrapper Reference.
These examples assume the following folder and file structure:
Local Parquet/CSV files can now be globbed by specifying either a path name with a wildcard or a folder name.
Globbing a folder recursively returns all files under the specified folder. For example,
COPY table_1 FROM ".../subdir";
returns file_3
, file_4
, file_5
.
Globbing with a wildcard returns any file paths matching the expanded file path. So
COPY table_1 FROM ".../subdir/file*";
returns file_3
, file_4
.
Does not apply to S3 cases, because file paths specified for S3 always use prefix matching.
Use file filtering to filter out unwanted files that have been globbed. To use filtering, specify the REGEX_PATH_FILTER
option. Files not matching this pattern are not included on import. Consistent across local and S3 use cases.
The following regex expression:
COPY table_1 from ".../" WITH (REGEX_PATH_FILTER=".*file_[4-5]");
returns file_4
, file_5
.
Use the FILE_SORT_ORDER_BY
option to specify the order in which files are imported.
FILE_SORT_ORDER_BY Options
pathname
(default)
date_modified
regex
*
regex_date
*
regex_number
*
*FILE_SORT_REGEX option required
Using FILE_SORT_ORDER_BY
COPY table_1 from ".../" WITH (FILE_SORT_ORDER_BY="date_modified");
Using FILE_SORT_ORDER_BY with FILE_SORT_REGEX
Regex sort keys are formed by the concatenation of all capture groups from the FILE_SORT_REGEX
expression. Regex sort keys are strings but can be converted to dates or FLOAT64 with the appropriate FILE_SORT_ORDER_BY
option. File paths that do not match the provided capture groups or that cannot be converted to the appropriate date or FLOAT64 are treated as NULLs and sorted to the front in a deterministic order.
Multiple Capture Groups:
FILE_SORT_REGEX=".*/data_(.*)_(.*)_"
/root/dir/unmatchedFile
→ <NULL>
/root/dir/data_andrew_54321_
→ andrew54321
/root/dir2/data_brian_Josef_
→ brianJosef
Dates:
FILE_SORT_REGEX=".*data_(.*)
/root/data_222
→ <NULL> (invalid date conversion)
/root/data_2020-12-31
→ 2020-12-31
/root/dir/data_2021-01-01
→ 2021-01-01
Import:
COPY table_1 from ".../" WITH (FILE_SORT_ORDER_BY="regex", FILE_SORT_REGEX=".*file_(.)");
Limited filename globbing is supported for both geo and raster import. For example, to import a sequence of same-format GeoTIFF files into a single table, you can run the following:
COPY table FROM '/path/path/something_*.tiff' WITH (source_type='raster_file')
The files are imported in alphanumeric sort order, per regular glob rules, and all appended to the same table. This may fail if the files are not all of the same format (band count, names, and types).
For non-geo/raster files (CSV and Parquet), you can provide just the path to the directory OR a wildcard; for example:
/path/to/directory/
/path/to/directory
/path/to/directory/*
For geo/raster files, a wildcard is required, as shown in the last example.
SQLImporter is a Java utility run at the command line. It runs a SELECT statement on another database through JDBC and loads the result set into HeavyDB.
HEAVY.AI recommends that you use a service account with read-only permissions when accessing data from a remote database.
In release 4.6 and higher, the user ID (-u
) and password (-p
) flags are required. If your password includes a special character, you must escape the character using a backslash (\).
If the table does not exist in HeavyDB, SQLImporter
creates it. If the target table in HeavyDB does not match the SELECT statement metadata, SQLImporter
fails.
If the truncate flag is used, SQLImporter
truncates the table in HeavyDB before transferring the data. If the truncate flag is not used, SQLImporter
appends the results of the SQL statement to the target table in HeavyDB.
The -i
argument provides a path to an initialization file. Each line of the file is sent as a SQL statement to the remote database. You can use -i
to set additional custom parameters before the data is loaded.
The SQLImporter
string is case-sensitive. Incorrect case returns the following:
Error: Could not find or load main class com.mapd.utility.SQLimporter
You can migrate geo data types from a PostgreSQL database. The following table shows the correlation between PostgreSQL/PostGIS geo types and HEAVY.AI geo types.
point
point
lseg
linestring
linestring
linestring
polygon
polygon
multipolygon
multipolygon
Other PostgreSQL types, including circle, box, and path, are not supported.
By default, 100,000 records are selected from HeavyDB. To select a larger number of records, use the LIMIT statement.
Stream data into HeavyDB by attaching the StreamInsert program to the end of a data stream. The data stream can be another program printing to standard out, a Kafka endpoint, or any other real-time stream output. You can specify the appropriate batch size, according to the expected stream rates and your insert frequency. The target table must exist before you attempt to stream data into the table.
Setting
Default
Description
<table_name>
n/a
Name of the target table in OmniSci
<database_name>
n/a
Name of the target database in OmniSci
-u
n/a
User name
-p
n/a
User password
--host
n/a
Name of OmniSci host
--delim
comma (,)
Field delimiter, in single quotes
--line
newline (\n)
Line delimiter, in single quotes
--batch
10000
Number of records in a batch
--retry_count
10
Number of attempts before job fails
--retry_wait
5
Wait time in seconds after server connection failure
--null
n/a
String that represents null values
--port
6274
Port number for OmniSciDB on localhost
`-t
--transform`
n/a
Regex transformation
--print_error
False
Print error messages
--print_transform
False
Print description of transform.
--help
n/a
List options
For more information on creating regex transformation statements, see RegEx Replace.
You can use the SQL COPY FROM
statement to import files stored on Amazon Web Services Simple Storage Service (AWS S3) into an HEAVY.AI table, in much the same way you would with local files. In the WITH
clause, specify the S3 credentials and region information of the bucket accessed.
Access key and secret key, or session token if using temporary credentials, and region are required. For information about AWS S3 credentials, see https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys.
HEAVY.AI does not support the use of asterisks (*) in URL strings to import items. To import multiple files, pass in an S3 path instead of a file name, and COPY FROM
imports all items in that path and any subpath.
HEAVY.AI supports custom S3 endpoints, which allows you to import data from S3-compatible services, such as Google Cloud Storage.
To use custom S3 endpoints, add s3_endpoint
to the WITH
clause of a COPY FROM
statement; for example, to set the S3 endpoint to point to Google Cloud Services:
For information about interoperability and setup for Google Cloud Services, see Cloud Storage Interoperability.
You can also configure custom S3 endpoints by passing the s3_endpoint
field to Thrift import_table
.
The following examples show failed and successful attempts to copy the table trips from AWS S3.
The following example imports all the files in the trip.compressed
directory.
The table trips
is created with the following statement:
You can configure HEAVY.AI server to provide AWS credentials, which allows S3 Queries to be run without specifying AWS credentials. S3 Regions are not configured by the server, and will need to be passed in either as a client side environment variable or as an option with the request.
Example Commands
\detect
:
$ export AWS_REGION=us-west-1
heavysql > \detect <s3-bucket-uri
import_table
:
$ ./Heavyai-remote -h localhost:6274 import_table "'<session-id>'" "<table-name>" '<s3-bucket-uri>' 'TCopyParams(s3_region="'us-west-1'")'
COPY FROM
:
heavysql > COPY <table-name> FROM <s3-bucket-uri> WITH(s3_region='us-west-1');
Enable server privileges in the server configuration file heavy.conf
allow-s3-server-privileges = true
For bare metal installations set the following environment variables and restart the HeavyDB service:
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx
AWS_SESSION_TOKEN=xxx
(required only for AWS STS credentials)
For HeavyDB docker images, start a new container mounted with the configuration file using the option:
-v <dirname-containing-heavy.conf>:/var/lib/heavyai
and set the following environment options:
-e AWS_ACCESS_KEY_ID=xxx
-e AWS_SECRET_ACCESS_KEY=xxx
-e AWS_SESSION_TOKEN=xxx
(required only for AWS STS credentials)
Enable server privileges in the server configuration file heavy.conf
allow-s3-server-privileges = true
For bare metal installations Specify a shared AWS credentials file and profile with the following environment variables and restart the HeavyDB service.
AWS_SHARED_CREDENTIALS_FILE=~/.aws/credentials
AWS_PROFILE=default
For HeavyDB docker images, start a new container mounted with the configuration file and AWS shared credentials file using the following options:
-v <dirname-containing-/heavy.conf>:/var/lib/heavyai
-v <dirname-containing-/credentials>:/<container-credential-path>
and set the following environment options:
-e AWS_SHARED_CREDENTIALS_FILE=<container-credential-path>
-e AWS_PROFILE=<active-profile>
Prerequisites
An IAM Policy that has sufficient access to the S3 bucket.
An IAM AWS Service Role of type Amazon EC2
, which is assigned the IAM Policy from (1).
Setting Up an EC2 Instance with Roles
For a new EC2 Instance:
AWS Management Console > Services > Compute > EC2 > Launch Instance.
Select desired Amazon Machine Image (AMI) > Select.
Select desired Instance Type > Next: Configure Instance Details.
IAM Role > Select desired IAM Role > Review and Launch.
Review other options > Launch.
For an existing EC2 Instance:
AWS Management Console > Services > Compute > EC2 > Instances.
Mark desired instance(s) > Actions > Security > Modify IAM Role.
Select desired IAM Role > Save.
Restart the EC2 Instance.
You can ingest data from an existing Kafka producer to an existing table in HEAVY.AI using KafkaImporter
on the command line:
KafkaImporter
requires a functioning Kafka cluster. See the Kafka website and the Confluent schema registry documentation.
Setting
Default
Description
<table_name>
n/a
Name of the target table in OmniSci
<database_name>
n/a
Name of the target database in OmniSci
-u <username>
n/a
User name
-p <password>
n/a
User password
--host <hostname>
localhost
Name of OmniSci host
--port <port_number>
6274
Port number for OmniSciDB on localhost
--http
n/a
Use HTTP transport
--https
n/a
Use HTTPS transport
--skip-verify
n/a
Do not verify validity of SSL certificate
--ca-cert <path>
n/a
Path to the trusted server certificate; initiates an encrypted connection
--delim <delimiter>
comma (,)
Field delimiter, in single quotes
--line <delimiter>
newline (\n)
Line delimiter, in single quotes
--batch <batch_size>
10000
Number of records in a batch
--retry_count <retry_number>
10
Number of attempts before job fails
--retry_wait <seconds>
5
Wait time in seconds after server connection failure
--null <string>
n/a
String that represents null values
--quoted <boolean>
false
Whether the source contains quoted fields
`-t
--transform`
n/a
Regex transformation
--print_error
false
Print error messages
--print_transform
false
Print description of transform
--help
n/a
List options
--group-id <id>
n/a
Kafka group ID
--topic <topic>
n/a
The Kafka topic to be ingested
--brokers <broker_name:broker_port>
localhost:9092
One or more brokers
KafkaImporter Logging Options
Setting
Default
Description
--log-directory <directory>
mapd_log
Logging directory; can be relative to data directory or absolute
--log-file-name <filename>
n/a
Log filename relative to logging directory; has format KafkaImporter.{SEVERITY}.%Y%m%d-%H%M%S.log
--log-symlink <symlink>
n/a
Symlink to active log; has format KafkaImporter.{SEVERITY}
--log-severity <level>
INFO
Log-to-file severity level: INFO, WARNING, ERROR, or FATAL
--log-severity-clog <level>
ERROR
Log-to-console severity level: INFO, WARNING, ERROR, or FATAL
--log-channels
n/a
Log channel debug info
--log-auto-flush
n/a
Flush logging buffer to file after each message
--log-max-files <files_number>
100
Maximum number of log files to keep
--log-min-free-space <bytes>
20,971,520
Minimum number of bytes available on the device before oldest log files are deleted
--log-rotate-daily
1
Start new log files at midnight
--log-rotation-size <bytes>
10485760
Maximum file size, in bytes, before new log files are created
Configure KafkaImporter
to use your target table. KafkaImporter
listens to a pre-defined Kafka topic associated with your table. You must create the table before using the KafkaImporter
utility. For example, you might have a table named customer_site_visit_events
that listens to a topic named customer_site_visit_events_topic
.
The data format must be a record-level format supported by HEAVY.AI.
KafkaImporter
listens to the topic, validates records against the target schema, and ingests topic batches of your designated size to the target table. Rejected records use the existing reject reporting mechanism. You can start, shut down, and configure KafkaImporter
independent of the HeavyDB engine. If KafkaImporter is running and the database shuts down, KafkaImporter shuts down as well. Reads from the topic are nondestructive.
KafkaImporter
is not responsible for event ordering; a streaming platform outside HEAVY.AI (for example, Spark streaming, flink) should handle the stream processing. HEAVY.AI ingests the end-state stream of post-processed events.
KafkaImporter
does not handle dynamic schema creation on first ingest, but must be configured with a specific target table (and its schema) as the basis. There is a 1:1 correspondence between target table and topic.
StreamImporter is an updated version of the StreamInsert utility used for streaming reads from delimited files into HeavyDB. StreamImporter uses a binary columnar load path, providing improved performance compared to StreamInsert.
You can ingest data from a data stream to an existing table in HEAVY.AI using StreamImporter
on the command line.
Setting
Default
Description
<table_name>
n/a
Name of the target table in OmniSci
<database_name>
n/a
Name of the target database in OmniSci
-u <username>
n/a
User name
-p <password>
n/a
User password
--host <hostname>
n/a
Name of OmniSci host
--port <port>
6274
Port number for OmniSciDB on localhost
--http
n/a
Use HTTP transport
--https
n/a
Use HTTPS transport
--skip-verify
n/a
Do not verify validity of SSL certificate
--ca-cert <path>
n/a
Path to the trusted server certificate; initiates an encrypted connection
--delim <delimiter>
comma (,)
Field delimiter, in single quotes
--null <string>
n/a
String that represents null values
--line <delimiter>
newline (\n)
Line delimiter, in single quotes
--quoted <boolean>
true
Either true
or false
, indicating whether the input file contains quoted fields.
--batch <number>
10000
Number of records in a batch
--retry_count <retry_number>
10
Number of attempts before job fails
--retry_wait <seconds>
5
Wait time in seconds after server connection failure
`-t
--transform`
n/a
Regex transformation
--print_error
false
Print error messages
--print_transform
false
Print description of transform
--help
n/a
List options
Setting
Default
Description
--log-directory <directory>
mapd_log
Logging directory; can be relative to data directory or absolute
--log-file-name <filename>
n/a
Log filename relative to logging directory; has format StreamImporter.{SEVERITY}.%Y%m%d-%H%M%S.log
--log-symlink <symlink>
n/a
Symlink to active log; has format StreamImporter.{SEVERITY}
--log-severity <level>
INFO
Log-to-file severity level: INFO, WARNING, ERROR, or FATAL
--log-severity-clog <level>
ERROR
Log-to-console severity level: INFO, WARNING, ERROR, or FATAL
--log-channels
n/a
Log channel debug info
--log-auto-flush
n/a
Flush logging buffer to file after each message
--log-max-files <files_number>
100
Maximum number of log files to keep
--log-min-free-space <bytes>
20,971,520
Minimum number of bytes available on the device before oldest log files are deleted
--log-rotate-daily
1
Start new log files at midnight
--log-rotation-size <bytes>
10485760
Maximum file size, in bytes, before new log files are created
Configure StreamImporter
to use your target table. StreamImporter
listens to a pre-defined data stream associated with your table. You must create the table before using the StreamImporter
utility.
The data format must be a record-level format supported by HEAVY.AI.
StreamImporter
listens to the stream, validates records against the target schema, and ingests batches of your designated size to the target table. Rejected records use the existing reject reporting mechanism. You can start, shut down, and configure StreamImporter
independent of the HeavyDB engine. If StreamImporter is running but the database shuts down, StreamImporter shuts down as well. Reads from the stream are non-destructive.
StreamImporter
is not responsible for event ordering - a first class streaming platform outside HEAVY.AI (for example, Spark streaming, flink) should handle the stream processing. HEAVY.AI ingests the end-state stream of post-processed events.
StreamImporter
does not handle dynamic schema creation on first ingest, but must be configured with a specific target table (and its schema) as the basis.
There is a 1:1 correspondence between target table and a stream record.
You can consume a CSV or Parquet file residing in HDFS (Hadoop Distributed File System) into HeavyDB.
Copy the HEAVY.AI JDBC driver into the Apache Sqoop library, normally found at /usr/lib/sqoop/lib/.
The following is a straightforward import command. For more information on options and parameters for using Apache Sqoop, see the user guide at sqoop.apache.org.
The --connect
parameter is the address of a valid JDBC port on your HEAVY.AI instance.
To detect duplication prior to loading data into HeavyDB, you can perform the following steps. For this example, the files are labeled A,B,C...Z.
Load file A into table MYTABLE
.
Run the following query.
There should be no rows returned; if rows are returned, your first A file is not unique.
Load file B into table TEMPTABLE
.
Run the following query.
There should be no rows returned if file B is unique. Fix B if the information is not unique using details from the selection.
Load the fixed B file into MYFILE
.
Drop table TEMPTABLE
.
Repeat steps 3-6 for the rest of the set for each file prior to loading the data to the real MYTABLE
instance.
DDL - Tables
These functions are used to create and modify data tables in HEAVY.AI.
Table names must use the NAME format, described in regex notation as:
Table and column names can include quotes, spaces, and the underscore character. Other special characters are permitted if the name of the table or column is enclosed in double quotes (" ").
Spaces and special characters other than underscore (_) cannot be used in Heavy Immerse.
Column and table names enclosed in double quotes cannot be used in Heavy Immerse
Create a table named <table>
specifying <columns>
and table properties.
Datatype
Size (bytes)
Notes
BIGINT
8
Minimum value: -9,223,372,036,854,775,807
; maximum value: 9,223,372,036,854,775,807
.
BOOLEAN
1
TRUE: 'true'
, '1'
, 't'
. FALSE: 'false'
, '0'
, 'f'
. Text values are not case-sensitive.
DATE
*
4
Same as DATE ENCODING DAYS(32)
.
DATE ENCODING DAYS(32)
4
Range in years: +/-5,883,517
around epoch. Maximum date January 1, 5885487 (approximately). Minimum value: -2,147,483,648
; maximum value: 2,147,483,647
. Supported formats when using COPY FROM
: mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, dd/mmm/yyyy
.
DATE ENCODING DAYS(16)
2
Range in days: -32,768
- 32,767
Range in years: +/-90
around epoch, April 14, 1880 - September 9, 2059.
Minumum value: -2,831,155,200
; maximum value: 2,831,068,800
.
Supported formats when using COPY FROM
: mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, dd/mmm/yyyy
.
DATE ENCODING FIXED(32)
4
In DDL statements defaults to DATE ENCODING DAYS(16)
. Deprecated.
DATE ENCODING FIXED(16)
2
In DDL statements defaults to DATE ENCODING DAYS(16)
. Deprecated.
DECIMAL
2, 4, or 8
Takes precision and scale parameters: DECIMAL(precision,scale)
.
Size depends on precision:
Up to 4
: 2 bytes
5
to 9
: 4 bytes
10
to 18
(maximum): 8 bytes
Scale must be less than precision.
DOUBLE
8
Variable precision. Minimum value: -1.79 x e^308
; maximum value: 1.79 x e^308
.
FLOAT
4
Variable precision. Minimum value: -3.4 x e^38
; maximum value: 3.4 x e^38
.
INTEGER
4
Minimum value: -2,147,483,647
; maximum value: 2,147,483,647
.
SMALLINT
2
Minimum value: -32,767
; maximum value: 32,767
.
TEXT ENCODING DICT
4
Max cardinality 2 billion distinct string values
TEXT ENCODING NONE
Variable
Size of the string + 6 bytes
TIME
8
Minimum value: 00:00:00
; maximum value: 23:59:59
.
TIMESTAMP
8
Linux timestamp from -30610224000
(1/1/1000 00:00:00.000
) through 29379542399
(12/31/2900 23:59:59.999
).
Can also be inserted and stored in human-readable format:
YYYY-MM-DD HH:MM:SS
YYYY-MM-DDTHH:MM:SS
(The T
is dropped when the field is populated.)
TINYINT
1
Minimum value: -127
; maximum value: 127
.
* In OmniSci release 4.4.0 and higher, you can use existing 8-byte DATE
columns, but you can create only 4-byte DATE
columns (default) and 2-byte DATE
columns (see DATE ENCODING FIXED(16)
).
For more information, see Datatypes and Fixed Encoding.
For geospatial datatypes, see Geospatial Primitives.
Create a table named tweets
and specify the columns, including type, in the table.
Create a table named delta and assign a default value San Francisco
to column city.
Default values currently have the following limitations:
Only literals can be used for column DEFAULT values; expressions are not supported.
You cannot define a DEFAULT value for a shard key. For example, the following does not parse: CREATE TABLE tbl (id INTEGER NOT NULL DEFAULT 0, name TEXT, shard key (id)) with (shard_count = 2);
For arrays, use the following syntax: ARRAY[A, B, C, …. N]
The syntax {A, B, C, ... N}
is not supported.
Some literals, like NUMERIC and GEO types, are not checked at parse time. As a result, you can define and create a table with malformed literal as a default value, but when you try to insert a row with a default value, it will throw an error.
Encoding
Descriptions
DICT
Dictionary encoding on string columns (default for TEXT
columns). Limit of 2 billion unique string values.
FIXED
(bits)
NONE
No encoding. Valid only on TEXT
columns. No Dictionary is created. Aggregate operations are not possible on this column type.
Property
Description
fragment_size
Number of rows per fragment that is a unit of the table for query processing. Default: 32 million rows, which is not expected to be changed.
max_rollback_epochs
Limit the number of epochs a table can be rolled back to. Limiting the number of epochs helps to limit the amount of on-disk data and prevent unmanaged data growth.
Limiting the number of rollback epochs also can increase system startup speed, especially for systems on which data is added in small batches or singleton inserts. Default: 3.
The following example creates the table test_table
and sets the maximum epoch rollback number to 50:
CREATE TABLE test_table(a int) WITH (MAX_ROLLBACK_EPOCHS = 50);
max_rows
Used primarily for streaming datasets to limit the number of rows in a table, to avoid running out of memory or impeding performance. When the max_rows
limit is reached, the oldest fragment is removed. When populating a table from a file, make sure that your row count is below the max_rows
setting. If you attempt load more rows at one time than the max_rows
setting defines, the records up to the max_rows
limit are removed, leaving only the additional rows. Default: 2^62.
In a distributed system, the maximum number of rows is calculated as max_rows * leaf_count
. In a sharded distributed system, the maximum number of rows is calculated as max_rows * shard_count
.
page_size
Number of I/O page bytes. Default: 1MB, which does not need to be changed.
partitions
Partition strategy option:
SHARDED
: Partition table using sharding.
REPLICATED
: Partition table using replication.
shard_count
Number of shards to create, typically equal to the number of GPUs across which the data table is distributed.
sort_column
Name of the column on which to sort during bulk import.
Sharding partitions a database table across multiple servers so each server has a part of the table with the same columns but with different rows. Partitioning is based on a sharding key defined when you create the table.
Without sharding, the dimension tables involved in a join are replicated and sent to each GPU, which is not feasible for dimension tables with many rows. Specifying a shard key makes it possible for the query to execute efficiently on large dimension tables.
Currently, specifying a shard key is useful for joins, only:
If two tables specify a shard key with the same type and the same number of shards, a join on that key only sends a part of the dimension table column data to each GPU.
For multi-node installs, the dimension table does not need to be replicated and the join executes locally on each leaf.
A shard key must specify a single column to shard on. There is no support for sharding by a combination of keys.
One shard key can be specified for a table.
Data are partitioned according to the shard key and the number of shards (shard_count
).
A value in the column specified as a shard key is always sent to the same partition.
The number of shards should be equal to the number of GPUs in the cluster.
Sharding is allowed on the following column types:
DATE
INT
TEXT ENCODING DICT
TIME
TIMESTAMP
Tables must share the dictionary for the column to be involved in sharded joins. If the dictionary is not specified as shared, the join does not take advantage of sharding. Dictionaries are reference-counted and only dropped when the last reference drops.
Set shard_count
to the number of GPUs you eventually want to distribute the data table across.
Referenced tables must also be shard_count
-aligned.
Sharding should be minimized because it can introduce load skew accross resources, compared to when sharding is not used.
Examples
Basic sharding:
Sharding with shared dictionary:
Using the TEMPORARY argument creates a table that persists only while the server is live. They are useful for storing intermediate result sets that you access more than once.
Adding or dropping a column from a temporary table is not supported.
Create a table with the specified columns, copying any data that meet SELECT statement criteria.
Property
Description
fragment_size
Number of rows per fragment that is a unit of the table for query processing. Default = 32 million rows, which is not expected to be changed.
max_chunk_size
Size of chunk that is a unit of the table for query processing. Default: 1073741824 bytes (1 GB), which is not expected to be changed.
max_rows
Used primarily for streaming datasets to limit the number of rows in a table. When the max_rows
limit is reached, the oldest fragment is removed. When populating a table from a file, make sure that your row count is below the max_rows
setting. If you attempt load more rows at one time than the max_rows
setting defines, the records up to the max_rows
limit are removed, leaving only the additional rows. Default = 2^62.
page_size
Number of I/O page bytes. Default = 1MB, which does not need to be changed.
partitions
Partition strategy option:
SHARDED
: Partition table using sharding.
REPLICATED
: Partition table using replication.
use_shared_dictionaries
Controls whether the created table creates its own dictionaries for text columns, or instead shares the dictionaries of its source table. Uses shared dictionaries by default (true
), which increases the speed of table creation.
Setting to false shrinks the dictionaries if SELECT for the created table has a narrow filter; for example:
CREATE TABLE new_table AS SELECT * FROM old_table WITH (USE_SHARED_DICTIONARIES='false');
vacuum
Formats the table to more efficiently handle DELETE
requests. The only parameter available is delayed
. Rather than immediately remove deleted rows, vacuum marks items to be deleted, and they are removed at an optimal time.
Create the table newTable
. Populate the table with all information from the table oldTable
, effectively creating a duplicate of the original table.
Create a table named trousers
. Populate it with data from the columns name
, waist
, and inseam
from the table wardrobe
.
Create a table named cosmos
. Populate it with data from the columns star
and planet
from the table universe where planet has the class M.
Rename the table tweets to retweets.
Rename the column source to device in the table retweets.
Add the column pt_dropoff to table tweets with a default value point(0,0).
Add multiple columns a, b, and c to table table_one with a default value of 15
for column b.
Default values currently have the following limitations:
Only literals can be used for column DEFAULT values; expressions are not supported.
For arrays, use the following syntax: ARRAY[A, B, C, …. N]
. The syntax {A, B, C, ... N}
is not supported.
Some literals, like NUMERIC and GEO types, are not checked at parse time. As a result, you can define and create a table with a malformed literal as a default value, but when you try to insert a row with a default value, it throws an error.
Add the column lang to the table tweets using a TEXT ENCODING DICTIONARY.
Add the columns lang and encode to the table tweets using a TEXT ENCODING DICTIONARY for each.
Drop the column pt_dropoff from table tweets.
Limit on-disk data growth by setting the number of allowed epoch rollbacks to 50:
You cannot add a dictionary-encoded string column with a shared dictionary when using ALTER TABLE ADD COLUMN.
Currently, HEAVY.AI does not support adding a geo column type (POINT, LINESTRING, POLYGON, or MULTIPOLYGON) to a table.
HEAVY.AI supports ALTER TABLE RENAME TABLE and ALTER TABLE RENAME COLUMN for temporary tables. HEAVY.AI does not support ALTER TABLE ADD COLUMN to modify a temporary table.
Deletes the table structure, all data from the table, and any dictionary content unless it is a shared dictionary. (See the Note regarding disk space reclamation.)
Archives data and dictionary files of the table <table>
to file <filepath>
.
Valid values for <compression_program>
include:
gzip (default)
pigz
lz4
none
If you do not choose a compression option, the system uses gzip if it is available. If gzip is not installed, the file is not compressed.
The file path must be enclosed in single quotes.
Dumping a table locks writes to that table. Concurrent reads are supported, but you cannot import to a table that is being dumped.
The DUMP
command is not supported on distributed configurations.
You must have a least GRANT CREATE ON DATABASE privilege level to use the DUMP
command.
Rename a table or multiple tables at once.
Rename a single table:
Swap table names:
Swap table names multiple times:
Restores data and dictionary files of table <table>
from the file at <filepath>
. If you specified a compression program when you used the DUMP TABLE
command, you must specify the same compression method during RESTORE
.
Restoring a table decompresses and then reimports the table. You must have enough disk space for both the new table and the archived table, as well as enough scratch space to decompress the archive and reimport it.
The file path must be enclosed in single quotes.
You can also restore a table from archives stored in S3-compatible endpoints:
s3_region
is required. All features discussed in the S3 import documentation, such as custom S3 endpoints and server privileges, are supported.
Restoring a table locks writes to that table. Concurrent reads are supported, but you cannot import to a table that is being restored.
The RESTORE
command is not supported on distributed configurations.
You must have a least GRANT CREATE ON DATABASE privilege level to use the RESTORE
command.
Do not attempt to use RESTORE TABLE with a table dump created using a release of HEAVY.AI that is higher than the release running on the server where you will restore the table.
Restore table tweets
from /opt/archive/tweetsBackup.gz:
Restore table tweets
from a public S3 file or using server privileges (with the allow-s3-server-privileges
server flag enabled):
Restore table tweets
from a private S3 file using AWS access keys:
Restore table tweets
from a private S3 file using temporary AWS access keys/session token:
Restore table tweets
from an S3-compatible endpoint:
Use the TRUNCATE TABLE
statement to remove all rows from a table without deleting the table structure.
This releases table on-disk and memory storage and removes dictionary content unless it is a shared dictionary. (See the note regarding disk space reclamation.)
Removing rows is more efficient than using DROP TABLE. Dropping followed by recreating the table invalidates dependent objects of the table requiring you to regrant object privileges. Truncating has none of these effects.
When you DROP or TRUNCATE, the command returns almost immediately. The directories to be purged are marked with the suffix \_DELETE_ME_. The files are automatically removed asynchronously.
In practical terms, this means that you will not see a reduction in disk usage until the automatic task runs, which might not start for up to five minutes.
You might also see directory names appended with \_DELETE_ME_. You can ignore these, with the expectation that they will be deleted automatically over time.
Use this statement to remove rows from storage that have been marked as deleted via DELETE
statements.
When run without the vacuum option, the column-level metadata is recomputed for each column in the specified table. HeavyDB makes heavy use of metadata to optimize query plans, so optimizing table metadata can increase query performance after metadata widening operations such as updates or deletes. If the configuration parameter enable-auto-metadata-update
is not set, HeavyDB does not narrow metadata during an update or delete — metadata is only widened to cover a new range.
When run with the vacuum option, it removes any rows marked "deleted" from the data stored on disk. Vacuum is a checkpointing operation, so new copies of any vacuum records are deleted. Using OPTIMIZE with the VACUUM option compacts pages and deletes unused data files that have not been repopulated.
Beginning with Release 5.6.0, OPTIMIZE should be used infrequently, because UPDATE, DELETE, and IMPORT queries manage space more effectively.
Performs checks for negative and inconsistent epochs across table shards for single-node configurations.
If VALIDATE
detects epoch-related issues, it returns a report similar to the following:
If no issues are detected, it reports as follows:
Perform checks and report discovered issues on a running HEAVY.AI cluster. Compare metadata between the aggregator and leaves to verify that the logical components between the processes are identical.
VALIDATE CLUSTER
also detects and reports issues related to table epochs. It reports when epochs are negative or when table epochs across leaf nodes or shards are inconsistent.
If VALIDATE CLUSTER
detects issues, it returns a report similar to the following:
If no issues are detected, it will report as follows:
You can include the WITH(REPAIR_TYPE)
argument. (REPAIR_TYPE='NONE')
is the same as running the command with no argument. (REPAIR_TYPE='REMOVE')
removes any leaf objects that have issues. For example:
This example output from the VALIDATE CLUSTER
command on a distributed setup shows epoch-related issues:
Heavy Immerse supports file upload for .csv, .tsv, and .txt files, and supports comma, tab, and pipe delimiters.
Heavy Immerse also supports upload of compressed delimited files in TAR, ZIP, 7-ZIP, RAR, GZIP, BZIP2, or TGZ format.
You can import data to HeavyDB using the Immerse import wizard. You can upload data from a local delimited file, from an Amazon S3 data source, or from the Data Catalog.
For methods specific to geospatial data, see also Importing Geospatial Data Using Immerse.
If there is a potential for duplicate entries, and you prefer to avoid loading duplicate rows, see How can I avoid creating duplicate rows?.
If a source file uses a reserved word, OmniSci automatically adds an underscore at the end of the reserved word. For example, year
is converted to year_
.
If you click the Back button (or accidentally two-finger swipe your mousepad) before your data load is complete, OmniSciDB stops the data load and any records that had transferred are invalidated.
Follow these steps to import your data:
Click DATA MANAGER.
Click Import Data.
Click Import data from a local file.
Either click the plus sign (+) or drag your file(s) for upload. If you are uploading multiple files, the column names and data types must match. HEAVY.AI supports only delimiter-separated formats such as CSV and TSV. HEAVY.AI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI. In addition to CSV, TSV, and TXT files, you can import compressed delimited files in TAR, ZIP, 7-ZIP, RAR, GZIP, BZIP2, or TGZ format.
Choose Import Settings:
Null string: If, instead using a blank for null cells in your upload document, you have substituted strings such as NULL, enter that string in the Null String field. The values are treated as null values on upload.
Delimiter Type: Delimiters are detected automatically. You can choose a specific delimiter, such as a comma, tab, or pipe.
Quoted String: Indicate whether your string fields are enclosed by quotes. Delimiter characters inside quotes are ignored.
Includes Header Row: HEAVY.AI tries to infer whether the first row contains headers or data (for example, if the first row has only strings and the rest of the table contains number values, the first row is inferred to be headers). If HEAVY.AI infers incorrectly, you have the option of manually indicating whether or not the first row contains headers.
Replicate Table: If you are importing non-geospatial data to a distributed database with more than one node, select this checkbox to replicate the table to all nodes in the cluster. This effectively adds the PARTITIONS='REPLICATED' option to the create table statement. See Replicated Tables.
Click Import Files.
The Table Preview screen presents sample rows of imported data. The importer assigns a data type based on sampling, but you should examine and modify the selections as appropriate. Assign the correct data type to ensure optimal performance. Immerse defaults to second precision for all timestamp columns. You can reset the precision to second, millisecond, nanosecond, or microsecond. If your column headers contain SQL reserved words, reserved characters (for example, year, /, or #), or spaces, the importer alters the characters to make them safe and notifies you of the changes. You can also change the column labels.
Name the table, and click Save Table.
You can also import locally stored shape files in a variety of formats. See Importing Geospatial Data Using Immerse.
To import data from your Amazon S3 instance, you need:
The Region and Path for the file in your S3 bucket, or the direct URL to the file (S3 Link).
If importing private data, your Access Key and Secret Key for your personal IAM account in S3.
For information on opening and reviewing items in your S3 instance, see https://docs.aws.amazon.com/AmazonS3/latest/gsg/OpeningAnObject.html
In an S3 bucket, the Region is in the upper-right corner of the screen – US West (N. California) in this case:
Click the file you want to import. To load your S3 file to HEAVY.AI using the steps for S3 Region | Bucket | Path, below, click Copy path to copy to your clipboard the path to your file within your S3 bucket. Alternatively, you can copy the link to your file. The Link in this example is https://s3-us-west-1.amazonaws.com/my-company-bucket/trip_data.7z
.
To learn about creating your S3 Access Key and Secret Key, see https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey
If the data you want to copy is publicly available, you do not need to provide an Access Key and Secret Key.
You can import any file you can see using your IAM account with your Access Key and Secret Key.
Your Secret Key is created with your Access Key, and cannot be retrieved afterward. If you lose your Secret Key, you must create a new Access Key and Secret Key.
Follow these steps to import your S3 data:
Click DATA MANAGER.
Click Import Data.
Click Import data from Amazon S3.
Choose whether to import using the S3 Region | Bucket | Path or a direct full link URL to the file (S3 Link).
To import data using S3 Region | Bucket | Path:
Select your Region from the pop-up menu.
Enter the unique name of your S3 Bucket.
Enter or paste the Path to the file stored in your S3 bucket.
To import data using S3 link:
Copy the Link URL from the file Overview in your S3 bucket.
Paste the link in the Full Link URL field of the HEAVY.AI Table Importer.
If the data is publicly available, you can disable the Private Data checkbox. If you are importing Private Data, enter your credentials:
Enable the Private Data checkbox.
Enter your S3 Access Key.
Enter your S3 Secret Key.
Choose the appropriate Import Settings. HEAVY.AI supports only delimiter-separated formats such as CSV and TSV.
Null string: If you have substituted a string such as NULL for null values in your upload document, enter that string in the Null String field. The values are treated as null values on upload.
Delimiter Type: Delimiters are detected automatically. You can choose a specific delimiter, such as a comma or pipe.
Includes Header Row: HEAVY.AI tries to infer whether the first row contains headers or data (for example, if the first row has only strings and the rest of the table contains number values, the first row is inferred to be headers). If HEAVY.AI infers incorrectly, you have the option of manually indicating whether or not the first row contains headers.
Quoted String: Indicate whether your string fields are enclosed by quotes. Delimiter characters inside quotes are ignored.
Click Import Files.
The Table Preview screen presents sample rows of imported data. The importer assigns a data type based on sampling, but you should examine and modify the selections as appropriate. Assign the correct data type to ensure optimal performance. If your column headers contain SQL reserved words, reserved characters (for example, year, /, or #), or spaces, the importer alters the characters to make them safe and notifies you of the changes. You can also change the column labels.
Name the table, and click Save Table.
The Data Catalog provides access to sample datasets you can use to exercise data visualization features in Heavy Immerse. The selection of datasets continually changes, independent of product releases.
To import from the data catalog:
Open the Data Manager.
Click Data Catalog.
Use the Search box to locate a specific data set, or scroll to find the dataset you want to use. The Contains Geo toggle filters for data sets that contain Geographical information.
Click the Import button beneath the dataset you want to use.
Verify the table and column names in the Data Preview screen.
Click Import Data.
You can append additional data to an existing table.
To append data to a table:
Open Data Manager.
Select the table you want to append.
Click Append Data.
Click Import data from a local file.
Either click the plus sign (+) or drag your file(s) for upload. The column names and data types of the files you select must match the existing table. HEAVY.AI supports only delimiter-separated formats such as CSV and TSV. HEAVY.AI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI. In addition to CSV, TSV, and TXT files, you can import compressed delimited files in TAR, ZIP, 7-ZIP, RAR, GZIP, BZIP2, or TGZ format.
Click Preview.
Click Import Settings
Choose Import Settings:
Null string: If, instead using a blank for null cells in your upload document, you have substituted strings such as NULL, enter that string in the Null String field. The values are treated as null values on upload.
Delimiter Type: Delimiters are detected automatically. You can choose a specific delimiter, such as a comma, tab, or pipe.
Quoted String: Indicate whether your string fields are enclosed by quotes. Delimiter characters inside quotes are ignored.
Includes Header Row: HEAVY.AI tries to infer whether the first row contains headers or data (for example, if the first row has only strings and the rest of the table contains number values, the first row is inferred to be headers). If HEAVY.AI infers incorrectly, you have the option of manually indicating whether or not the first row contains headers.
Replicate Table: If you are importing non-geospatial data to a distributed database with more than one node, select this checkbox to replicate the table to all nodes in the cluster. This effectively adds the PARTITIONS='REPLICATED' option to the create table statement. See Replicated Tables.
Close Import Settings.
The Data Preview screen presents sample rows of imported data. The importer assigns a data type based on sampling, but you should examine and modify the selections as appropriate. Assign the correct data type to ensure optimal performance.
If your data contains column headers, verify they match the existing headers.
Click Import Data.
To append data from AWS, click Append Data, then follow the instructions for Loading S3 Data to HEAVY.AI.
Sometimes you might want to remove or replace the data in a table without losing the table definition itself.
To remove all data from a table:
Open Data Manager.
Select the table you want to truncate.
Click Delete All Rows.
A very scary red dialog box reminds you that the operation cannot be undone. Click DELETE TABLE ROWS.
Immerse displays the table information with a row count of 0.
You can drop a table entirely using Data Manager.
To delete a table:
Open Data Manager.
Select the table you want to delete.
Click DELETE TABLE.
A very scary red dialog box reminds you that the operation cannot be undone. Click DELETE TABLE.
Immerse deletes the table and returns you to the Data Manager TABLES list.
DDL - Users and Databases
HEAVY.AI has a default superuser named admin
with default password HyperInteractive
.
When you create or alter a user, you can grant superuser privileges by setting the is_super
property.
You can also specify a default database when you create or alter a user by using the default_db
property. During login, if a database is not specified, the server uses the default database assigned to that user. If no default database is assigned to the user and no database is specified during login, the heavyai
database is used.
When an administrator, superuser, or owner drops or renames a database, all current active sessions for users logged in to that database are invalidated. The users must log in again.
Similarly, when an administrator or superuser drops or renames a user, all active sessions for that user are immediately invalidated.
If a password includes characters that are nonalphanumeric, it must be enclosed in single quotes when logging in to heavysql. For example:
$HEAVYAI_PATH/bin/heavysql heavyai -u admin -p '77Heavy!9Ai'
For more information about users, roles, and privileges, see .
The following are naming convention requirements for HEAVY.AI objects, described in notation:
A NAME is [A-Za-z_][A-Za-z0-9\$_]*
A DASHEDNAME is [A-Za-z_][A-Za-z0-9\$_\-]*
An EMAIL is ([^[:space:]\"]+|\".+\")@[A-Za-z0-9][A-Za-z0-9\-\.]*\.[A-Za-z]+
User objects can use NAME, DASHEDNAME, or EMAIL format.
Role objects must use either NAME or DASHEDNAME format.
Database and column objects must use NAME format.
HEAVY.AI accepts (almost) any string enclosed in optional double quotation marks as the user name.
Examples:
Example:
HEAVY.AI accepts (almost) any string enclosed in optional double quotation marks as the old or new user name.
Example:
Database names cannot include quotes, spaces, or special characters.
In Release 6.3.0 and later, database names are case insensitive. Duplicate database names will cause a failure when attempting to start HeavyDB 6.3.0 or higher. Check database names and revise as necessary to avoid duplicate names.
Example:
Example:
To alter a database, you must be the owner of the database or an HeavyDB superuser.
Example:
Enable super users to change the owner of a database.
Change the owner of my_database
to user Joe
:
Only superusers can run the ALTER DATABASE OWNER TO command.
Changes ownership of database objects (tables, views, dashboards, etc.) from a user or set of users in the current database to a different user.
Example: Reassign database objects owned by jason
and mike
to joe
.
Database object ownership changes only for the currently connected database; objects in other databases are not affected. Ownership of the database itself is not affected. You must be a superuser to run this command.
You can use policies to provide row-level security (RLS) in HEAVY.AI.
Create an RLS policy for a user or role (<name>
); admin rights are required. All queries on the table for the user or role are automatically filtered to include only rows where the column contains any one of the values from the VALUES clause.
RLS filtering works similarly to a WHERE column = value
clause, appended to every query or subquery on the table, would work. If policies on multiple columns in the same table are defined for a user or role, then a row is visible to that user or role if any one or more of the policies matches that row.
Drop an RLS policy for a user or role (<name>
); admin rights are required. All values specified for the column by the policy are dropped. Effective values from another policy on an inherited role are not dropped.
Displays a list of all RLS policies that exist for a user or role. If EFFECTIVE is used, the list also include any policies that exist for all roles that apply to the requested user or role.
Datatypes and Fixed Encoding
This topic describes standard datatypes and space-saving variations for values stored in HEAVY.AI.
Each HEAVY.AI datatype uses space in memory and on disk. For certain datatypes, you can use fixed encoding for a more compact representation of these values. You can set a default value for a column by using the DEFAULT
constraint; for more information, see .
Datatypes, variations, and sizes are described in the following table.
[1] - In OmniSci release 4.4.0 and higher, you can use existing 8-byte DATE
columns, but you can create only 4-byte DATE
columns (default) and 2-byte DATE
columns (see DATE ENCODING DAYS(16)
).
HEAVY.AI does not support geometry arrays.
Timestamp values are always stored in 8 bytes. The greater the precision, the lower the fidelity.
HEAVY.AI supports the LINESTRING, MULTILINESTRING
, POLYGON, MULTIPOLYGON
, POINT
, and MULTIPOINT
geospatial datatypes.
In the following example:
p0
, p1
, ls0
, and poly0
are simple (planar) geometries.
p4
is point geometry with Web Mercator longitude/latitude coordinates.
p2
, p3
, mp, ls1
, ls2, mls1, mls2
, poly1
, and mpoly0
are geometries using WGS84 SRID=4326 longitude/latitude coordinates.
Geometry storage requirements are largely dependent on coordinate data. Coordinates are normally stored as 8-byte doubles, two coordinates per point, for all points that form a geometry. Each POINT geometry in the p1 column, for example, requires 16 bytes.
WGS84 (SRID 4326) coordinates are compressed to 32 bits by default. This sacrifices some precision but reduces storage requirements by half.
For example, columns p2, mp, ls1, mls1, poly1, and mpoly0 in the table defined above are compressed. Each geometry in the p2 column requires 8 bytes, compared to 16 bytes for p0.
You can explicitly disable compression. WGS84 columns p3, ls2, mls2 are not compressed and continue using doubles. Simple (planar) columns p0, p1, ls0, poly1 and non-4326 column p4 are not compressed.
Define datatype arrays by appending square brackets, as shown in the arrayexamples
DDL sample.
You can also define fixed-length arrays. For example:
Fixed-length arrays require less storage space than variable-length arrays.
To use fixed-length fields, the range of the data must fit into the constraints as described. Understanding your schema and the scope of potential values in each field helps you to apply fixed encoding types and save significant storage space.
These encodings are most effective on low-cardinality TEXT
fields, where you can achieve large savings of storage space and improved processing speed, and on TIMESTAMP
fields where the timestamps range between 1901-12-13 20:45:53
and 2038-01-19 03:14:07
. If a TEXT ENCODING
field does not match the defined cardinality, HEAVY.AI substitutes a NULL
value and logs the change.
For DATE
types, you can use the terms FIXED
and DAYS
interchangeably. Both are synonymous for the DATE
type in HEAVY.AI.
Some of the INTEGER
options overlap. For example, INTEGER ENCODINGFIXED(8)
and SMALLINT ENCODINGFIXED(8)
are essentially identical.
You can improve performance of string operations and optimize storage using shared dictionaries. You can share dictionaries within a table or between different tables in the same database. The table with which you want to share dictionaries must exist when you create the table that references the TEXT ENCODING DICT
field, and the column that you are referencing in that table must also exist. The following small DDL shows the basic structure:
In the table definition, make sure that referenced columns appear before the referencing columns.
For example, this DDL is a portion of the schema for the flights database. Because airports are both origin and destination locations, it makes sense to reuse the same dictionaries for name, city, state, and country values.
To share a dictionary in a different existing table, replace the table name in the REFERENCES
instruction. For example, if you have an existing table called us_geography
, you can share the dictionary by following the pattern in the DDL fragment below.
The referencing column cannot specify the encoding of the dictionary, because it uses the encoding from the referenced column.
Change a parameter value for the current session.
Switch to another database without need of re-login.
Your session will silently switch to the requested database.
The database exists, but the user does not have access to it:
The database does not exist:
Force the session to run the subsequent SQL commands in CPU mode:
Switch back the session to run in GPU mode
DDL - Views
A view is a virtual table based on the result set of a SQL statement. It derives its fields from a SELECT
statement. You can do anything with a HEAVY.AI view query that you can do in a non-view HEAVY.AI query.
View object names must use the NAME format, described in notation as:
Creates a view based on a SQL statement.
You can describe the view as you would a table.
You can query the view as you would a table.
Removes a view created by the CREATE VIEW statement. The view definition is removed from the database schema, but no actual data in the underlying base tables is modified.
HeavyDB system tables provide a way to access information about database objects, database object permissions, and system resource (storage, CPU, and GPU memory) utilization. These system tables can be found in the information_schema
database that is available by default on server startup. You can query system tables in the same way as regular tables, and you can use the SHOW CREATE TABLE
command to view the table schemas.
The users
system table provides information about all database users and contains the following columns:
The databases
system table provides information about all created databases on the server and contains the following columns:
The permissions
system table provides information about all user/role permissions for all database objects and contains the following columns:
The roles
system table lists all created database roles and contains the following columns:
The tables
system table provides information about all database tables and contains the following columns:
The dashboards
system table provides information about created dashboards (enterprise edition only) and contains the following columns:
The role_assignments
system table provides information about database roles that have been assigned to users and contains the following columns:
The memory_summary
system table provides high level information about utilized memory across CPU and GPU devices and contains the following columns:
The memory_details
system table provides detailed information about allocated memory segments across CPU and GPU devices and contains the following columns:
The storage_details
system table provides detailed information about utilized storage per table and contains the following columns:
Log-based system tables are considered beta functionality in Release 6.1.0 and are disabled by default.
The request_logs
system table provides information about HeavyDB Thrift API requests and contains the following columns:
The server_logs
system table provides HeavyDB server logs in tabular form and contains the following columns:
The web_server_logs
system table provides HEAVY.AI Web Server logs in tabular form and contains the following columns (Enterprise Edition only):
The logs system tables must be refreshed manually to view new log entries. You can run the REFRESH FOREIGN TABLES
SQL command (for example, REFRESH FOREIGN TABLES server_logs, request_logs;
), or click the Refresh Data Now button on the table’s Data Manager page in Heavy Immerse.
The Request Logs and Monitoring system dashboard is built on the log-based system tables and provides visualization of request counts, performance, and errors over time, along with the server logs.
Access to system dashboards is controlled using Heavy Immerse privileges; only users with Admin privileges or users/roles with access to the information_schema
database can access the system dashboards.
Cross-linking must be enabled to allow cross-filtering across charts that use different system tables. Enable cross-linking by adding "ui/enable_crosslink_panel": true
to the feature_flags
section of the servers.json file.
Allows specification of one or more band names to selectively import; useful in the context of large raster files where not all the bands are relevant.
Bands are imported in the order provided, regardless of order in the file.
You can rename bands using <bandname>=<newname>[,<bandname>=<newname,...>]
Names must be those discovered by the , including any suffixes for de-duplication.
Fixed length encoding of integer or timestamp columns. See .
See in for a database security example.
[2] - See and below for information about geospatial datatype sizes.
For more information about geospatial datatypes and functions, see .
The web_server_access_logs
system table provides information about requests made to the Web Server. The table contains the following columns:
Preconfigured are built on various system tables. Specifically, two dashboards named System Resources and User Roles and Permissions are available by default. The Request Logs and Monitoring system dashboard is considered beta functionality and disabled by default. These dashboards can be found in the information_schema
database, along with the system tables that they use.
Property
Value
password
User's password.
is_super
Set to true if user is a superuser. Default is false.
default_db
User's default database on login.
can_login
Set to true (default/implicit) to activate a user.
When false, the user still retains all defined privileges and configuration settings, but cannot log in to HEAVY.AI. Deactivated users who try to log in receive the error message "Unauthorized Access: User is deactivated."
Property
Value
password
User's password.
is_super
Set to true if user is a superuser. Default is false.
default_db
User's default database on login.
can_login
Set to true (default/implicit) to activate a user.
When false, the user still retains all defined privileges and configuration settings, but cannot log in to HEAVY.AI. Deactivated users who try to log in receive the error message "Unauthorized Access: User is deactivated."
Property
Value
owner
User name of the database owner.
Column Name
Column Type
Description
user_id
INTEGER
ID of database user.
user_name
TEXT
Username of database user.
is_super_user
BOOLEAN
Indicates whether or not the database user is a super user.
default_db_id
INTEGER
ID of user’s default database on login.
default_db_name
TEXT
Name of user’s default database on login.
can_login
BOOLEAN
Indicates whether or not the database user account is activated and can log in.
Column Name
Column Type
Description
database_id
INTEGER
ID of database.
database_name
TEXT
Name of database.
owner_id
INTEGER
User ID of database owner.
owner_user_name
TEXT
Username of database owner.
Column Name
Column Type
Description
role_name
TEXT
Username or role name associated with permission.
is_user_role
BOOLEAN
Boolean indicating whether or not the role_name
column identifies a user or a role.
database_id
INTEGER
ID of database that contains the database object for which permission was granted.
database_name
TEXT
Name of database that contains the database object on which permission was granted.
object_name
TEXT
Name of database object on which permission was granted.
object_id
INTEGER
ID of database object on which permission was granted.
object_owner_id
INTEGER
User id of the owner of the database object on which permission was granted.
object_owner_user_name
TEXT
Username of the owner of the database object on which permission was granted.
object_permission_type
TEXT
Type of database object on which permission was granted.
object_permissions
TEXT[]
List of permissions that were granted on database object.
Column Name
Column Type
Description
role_name
TEXT
Role name.
Column Name
Column Type
Description
database_id
INTEGER
ID of database that contains the table.
database_name
TEXT
Name of database that contains the table.
table_id
INTEGER
Table ID.
table_name
TEXT
Table name.
owner_id
INTEGER
User ID of table owner.
owner_user_name
TEXT
Username of table owner.
column_count
INTEGER
Number of table columns. Note that internal system columns are included in this count.
table_type
TEXT
Type of tables. Possible values are DEFAULT
, VIEW
, TEMPORARY
, and FOREIGN
.
view_sql
TEXT
For views, SQL statement used in the view.
max_fragment_size
INTEGER
Number of rows per fragment used by the table.
max_chunk_size
BIGINT
Maximum size (in bytes) of table chunks.
fragment_page_size
INTEGER
Size (in bytes) of table data pages.
max_rows
BIGINT
Maximum number of rows allowed by table.
max_rollback_epochs
INTEGER
Maximum number of epochs a table can be rolled back to.
shard_count
INTEGER
Number of shards that exists for table.
ddl_statement
TEXT
CREATE TABLE
DDL statement for table.
Column Name
Column Type
Description
database_id
INTEGER
ID of database that contains the dashboard.
database_name
TEXT
Name of database that contains the dashboard.
dashboard_id
INTEGER
Dashboard ID.
dashboard_name
TEXT
Dashboard name.
owner_id
INTEGER
User ID of dashboard owner.
owner_user_name
TEXT
Username of dashboard owner.
last_updated_at
TIMESTAMP
Timestamp of last dashboard update.
data_sources
TEXT[]
List to data sources/tables used by dashboard.
Column Name
Column Type
Description
role_name
TEXT
Name of assigned role.
user_name
TEXT
Username of user that was assigned the role.
Column Name
Column Type
Description
node
TEXT
Node from which memory information is fetched.
device_id
INTEGER
Device ID.
device_type
TEXT
Type of device. Possible values are CPU
and GPU
.
max_page_count
BIGINT
Maximum number of memory pages that can be allocated on the device.
page_size
BIGINT
Size (in bytes) of a memory page on the device.
allocated_page_count
BIGINT
Number of allocated memory pages on the device.
used_page_count
BIGINT
Number of used allocated memory pages on the device.
free_page_count
BIGINT
Number of free allocated memory pages on the device.
Column Name
Column Type
Description
node
TEXT
Node from which memory information is fetched.
database_id
INTEGER
ID of database that contains the table that memory was allocated for.
database_name
TEXT
Name of database that contains the table that memory was allocated for.
table_id
INTEGER
ID of table that memory was allocated for.
table_name
TEXT
Name of table that memory was allocated for.
column_id
INTEGER
ID of column that memory was allocated for.
column_name
TEXT
Name of column that memory was allocated for.
chunk_key
INTEGER[]
ID of cached table chunk.
device_id
INTEGER
Device ID.
device_type
TEXT
Type of device. Possible values are CPU
and GPU
.
memory_status
TEXT
Memory segment use status. Possible values are FREE
and USED
.
page_count
BIGINT
Number pages in the segment.
page_size
BIGINT
Size (in bytes) of a memory page on the device.
slab_id
INTEGER
ID of slab containing memory segment.
start_page
BIGINT
Page number of the first memory page in the segment.
last_touched_epoch
BIGINT
Epoch at which the segment was last accessed.
Column Name
Column Type
Description
node
TEXT
Node from which storage information is fetched.
database_id
INTEGER
ID of database that contains the table.
database_name
TEXT
Name of database that contains the table.
table_id
INTEGER
Table ID.
table_name
TEXT
Table Name.
epoch
INTEGER
Current table epoch.
epoch_floor
INTEGER
Minimum epoch table can be rolled back to.
fragment_count
INTEGER
Number of table fragments.
shard_id
INTEGER
Table shard ID. This value is only set for sharded tables.
data_file_count
INTEGER
Number of data files created for table.
metadata_file_count
INTEGER
Number of metadata files created for table.
total_data_file_size
BIGINT
Total size (in bytes) of data files.
total_data_page_count
BIGINT
Total number of pages across all data files.
total_free_data_page_count
BIGINT
Total number of free pages across all data files.
total_metadata_file_size
BIGINT
Total size (in bytes) of metadata files.
total_metadata_page_count
BIGINT
Total number of pages across all metadata files.
total_free_metadata_page_count
BIGINT
Total number of free pages across all metadata files.
total_dictionary_data_file_size
BIGINT
Total size (in bytes) of string dictionary files.
Column Name
Column Type
Description
log_timestamp
TIMESTAMP
Timestamp of log entry.
severity
TEXT
Severity level of log entry. Possible values are F (fatal), E (error), W (warning), and I (info).
process_id
INTEGER
Process ID of the HeavyDB instance that generated the log entry.
query_id
INTEGER
ID associated with a SQL query. A value of 0 indicates that either the log entry is unrelated to a SQL query or no query ID has been set for the log entry.
thread_id
INTEGER
ID of thread that generated the log entry.
file_location
TEXT
Source file name and line number where the log entry was generated.
api_name
TEXT
Name of Thrift API that the request was sent to.
request_duration_ms
BIGINT
Thrift API request duration in milliseconds.
database_name
TEXT
Request session database name.
user_name
TEXT
Request session username.
public_session_id
TEXT
Request session ID.
query_string
TEXT
Query string for SQL query requests.
client
TEXT
Protocol and IP address of client making the request.
dashboard_id
INTEGER
Dashboard ID for SQL query requests coming from Immerse dashboards.
dashboard_name
TEXT
Dashboard name for SQL query requests coming from Immerse dashboards.
chart_id
INTEGER
Chart ID for SQL query requests coming from Immerse dashboards.
execution_time_ms
BIGINT
Execution time in milliseconds for SQL query requests.
total_time_ms
BIGINT
Total execution time (execution_time_ms + serialization time) in milliseconds for SQL query requests.
Column Name
Column Type
Description
node
TEXT
Node containing logs.
log_timestamp
TIMESTAMP
Timestamp of log entry.
severity
TEXT
Severity level of log entry. Possible values are F (fatal), E (error), W (warning), and I (info).
process_id
INTEGER
Process ID of the HeavyDB instance that generated the log entry.
query_id
INTEGER
ID associated with a SQL query. A value of 0 indicates that either the log entry is unrelated to a SQL query or no query ID has been set for the log entry.
thread_id
INTEGER
ID of thread that generated the log entry.
file_location
TEXT
Source file name and line number where the log entry was generated.
message
TEXT
Log message.
Column Name
Column Type
Description
log_timestamp
TIMESTAMP
Timestamp of log entry.
severity
TEXT
Severity level of log entry. Possible values are fatal, error, warning, and info.
message
TEXT
Log message.
Column Name
Column Type
Description
ip_address
TEXT
IP address of client making the web server request.
log_timestamp
TIMESTAMP
Timestamp of log entry.
http_method
TEXT
HTTP request method.
endpoint
TEXT
Web server request endpoint.
http_status
SMALLINT
HTTP response status code.
response_size
BIGINT
Response payload size in bytes.
Datatype
Size (bytes)
Notes
BIGINT
8
Minimum value: -9,223,372,036,854,775,807
; maximum value: 9,223,372,036,854,775,807
.
BIGINT ENCODING FIXED(8)
1
Minimum value: -127
; maximum value: 127
BIGINT ENCODING FIXED(16)
2
Same as SMALLINT
.
BIGINT ENCODING FIXED(32)
4
Same as INTEGER
.
BOOLEAN
1
TRUE: 'true'
, '1'
, 't'
. FALSE: 'false'
, '0'
, 'f'
. Text values are not case-sensitive.
DATE
[1]
4
Same as DATE ENCODING DAYS(32)
.
DATE ENCODING DAYS(16)
2
Range in days: -32,768
- 32,767
Range in years: +/-90
around epoch, April 14, 1880 - September 9, 2059.
Minumum value: -2,831,155,200
; maximum value: 2,831,068,800
.
Supported formats when using COPY FROM
: mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, dd/mmm/yyyy
.
DATE ENCODING DAYS(32)
4
Range in years: +/-5,883,517
around epoch. Maximum date January 1, 5885487 (approximately). Minimum value: -2,147,483,648
; maximum value: 2,147,483,647
. Supported formats when using COPY FROM
: mm/dd/yyyy
, dd-mmm-yy
, yyyy-mm-dd
, dd/mmm/yyyy
.
DATE ENCODING FIXED(16)
2
In DDL statements defaults to DATE ENCODING DAYS(16)
. Deprecated.
DATE ENCODING FIXED(32)
4
In DDL statements defaults to DATE ENCODING DAYS(16)
. Deprecated.
DECIMAL
2, 4, or 8
Takes precision and scale parameters: DECIMAL(precision,scale)
Size depends on precision:
Up to 4
: 2 bytes
5
to 9
: 4 bytes
10
to 18
(maximum): 8 bytes
Scale must be less than precision.
DOUBLE
8
Variable precision. Minimum value: -1.79e308
; maximum value: 1.79e308
EPOCH
8
Seconds ranging from -30610224000
(1/1/1000 00:00:00
) through 185542587100800
(1/1/5885487 23:59:59
).
FLOAT
4
Variable precision. Minimum value: -3.4e38
; maximum value: 3.4e38
.
INTEGER
4
Minimum value: -2,147,483,647
; maximum value: 2,147,483,647
.
INTEGER ENCODING FIXED(8)
1
Minumum value: -127
; maximum value: 127
.
INTEGER ENCODING FIXED(16)
2
Same as SMALLINT
.
LINESTRING
Variable[2]
Geospatial datatype. A sequence of 2 or more points and the lines that connect them. For example: LINESTRING(0 0,1 1,1 2)
MULTILINESTRING
Variable[2]
Geospatial datatype. A set of associated lines. For example: MULTILINESTRING((0 0, 1 0, 2 0), (0 1, 1 1, 2 1))
MULTIPOINT
Variable[2]
Geospatial datatype. A set of points. For example: MULTIPOINT((0 0), (1 0), (2 0))
MULTIPOLYGON
Variable[2]
Geospatial datatype. A set of one or more polygons. For example:MULTIPOLYGON(((0 0,4 0,4 4,0 4,0 0),(1 1,2 1,2 2,1 2,1 1)), ((-1 -1,-1 -2,-2 -2,-2 -1,-1 -1)))
POINT
Variable[2]
Geospatial datatype. A point described by two coordinates. When the coordinates are longitude and latitude, HEAVY.AI stores longitude first, and then latitude. For example: POINT(0 0)
POLYGON
Variable[2]
Geospatial datatype. A set of one or more rings (closed line strings), with the first representing the shape (external ring) and the rest representing holes in that shape (internal rings). For example: POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))
SMALLINT
2
Minimum value: -32,767
; maximum value: 32,767
.
SMALLINT ENCODING FIXED(8)
1
Minumum value: -127
; maximum value: 127
.
TEXT ENCODING DICT
4
Max cardinality 2 billion distinct string values. Maximum string length is 32,767.
TEXT ENCODING DICT(8)
1
Max cardinality 255 distinct string values.
TEXT ENCODING DICT(16)
2
Max cardinality 64 K distinct string values.
TEXT ENCODING NONE
Variable
Size of the string + 6 bytes. Maximum string length is 32,767.
Note: Importing TEXT ENCODING NONE
fields using the Data Manager has limitations for Immerse. When you use string instead of string [dict. encode] for a column when importing, you cannot use that column in Immerse dashboards.
TIME
8
Minimum value: 00:00:00
; maximum value: 23:59:59
.
TIME ENCODING FIXED(32)
4
Minimum value: 00:00:00
; maximum value: 23:59:59
.
TIMESTAMP(0)
8
Linux timestamp from -30610224000
(1/1/1000 00:00:00
) through 29379542399
(12/31/2900 23:59:59
). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS
or YYYY-MM-DDTHH:MM:SS
(the T
is dropped when the field is populated).
TIMESTAMP(3) (milliseconds)
8
Linux timestamp from -30610224000000
(1/1/1000 00:00:00.000
) through 29379542399999
(12/31/2900 23:59:59.999
). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS.fff
or YYYY-MM-DDTHH:MM:SS.fff
(the T
is dropped when the field is populated).
TIMESTAMP(6) (microseconds)
8
Linux timestamp from -30610224000000000
(1/1/1000 00:00:00.000000
) through 29379542399999999
(12/31/2900 23:59:59.999999
). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS.ffffff
or YYYY-MM-DDTHH:MM:SS.ffffff
(the T
is dropped when the field is populated).
TIMESTAMP(9) (nanoseconds)
8
Linux timestamp from -9223372036854775807
(09/21/1677 00:12:43.145224193
) through 9223372036854775807
(11/04/2262 23:47:16.854775807
). Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS.fffffffff
or YYYY-MM-DDTHH:MM:SS.fffffffff
(the T
is dropped when the field is populated).
TIMESTAMP ENCODING FIXED(32)
4
Range: 1901-12-13 20:45:53
- 2038-01-19 03:14:07
. Can also be inserted and stored in human-readable format: YYYY-MM-DD HH:MM:SS
or YYYY-MM-DDTHH:MM:SS
(the T
is dropped when the field is populated).
TINYINT
1
Minimum value: -127
; maximum value: 127
.
EXECUTOR_DEVICE
CPU - Set the session to CPU execution mode:
ALTER SESSION SET EXECUTOR_DEVICE='CPU';
GPU - Set the session to GPU execution mode:
ALTER SESSION SET EXECUTOR_DEVICE='GPU';
NOTE: These parameter values have the same effect as the \cpu
and \gpu
commands in heavysql, but can be used with any tool capable of running sql commands.
CURRENT_DATABASE
Can be set to any string value.
If the value is a valid database name, and the current user has access to it, the session switches to the new database. If the user does not have access or the database does not exist, an error is returned and the session will fall back to the starting database.
Deletes rows that satisfy the WHERE
clause from the specified table. If the WHERE clause is absent, all rows in the table are deleted, resulting in a valid but empty table.
In Release 6.4 and higher, you can run DELETE queries across tables in different databases on the same HEAVY.AI cluster without having to first connect to those databases.
To execute queries against another database, you must have ACCESS privilege on that database, as well as DELETE privilege.
Delete rows from a table in the my_other_db
database:
Interrupt a queued query. Specify the query by using its session ID.
To see the queries in the queue, use the SHOW QUERIES command:
To interrupt the last query in the list (ID 946-ooNP
):
Showing the queries again indicates that 946-ooNP
has been deleted:
KILL QUERY is only available if the runtime query interrupt parameter (enable-runtime-query-interrupt
) is set.
Interrupting a query in ‘PENDING_QUEUE’ status is supported in both distributed and single-server mode.
To enable query interrupt for tables imported from data files in local storage, set enable_non_kernel_time_query_interrupt
to TRUE. (It is enabled by default.)
Use INSERT for both single- and multi-row ad hoc inserts. (When inserting many rows, use the more efficient COPY command.)
You can also insert into a table as SELECT, as shown in the following examples:
You can insert array literals into array columns. The inserts in the following example each have three array values, and demonstrate how you can:
Create a table with variable-length and fixed-length array columns.
Insert NULL
arrays into these colums.
Specify and insert array literals using {...}
or ARRAY[...]
syntax.
Insert empty variable-length arrays using{}
and ARRAY[]
syntax.
Insert array values that contain NULL
elements.
If you create a table with column that has a default value, or alter a table to add a column with a default value, using the INSERT command creates a record that includes the default value if it is omitted from the INSERT. For example, assume a table created as follows:
If you omit the name column from an INSERT or INSERT FROM SELECT statement, the missing value for column name
is set to 'John Doe'
.
INSERT INTO tbl (id, age) VALUES (1, 36);
creates the record 1|'John Doe'|36
.
INSERT INTO tbl (id, age) SELECT id, age FROM old_tbl;
also sets all the name values to John Doe
.
Expression
Description
LIKELY(X)
Provides a hint to the query planner that argument X
is a Boolean value that is usually true. The planner can prioritize filters on the value X
earlier in the execution cycle and return results more efficiently.
UNLIKELY(X)
Provides a hint to the query planner that argument X
is a Boolean value that is usually not true. The planner can prioritize filters on the value X
later in the execution cycle and return results more efficiently.
Usage Notes
SQL normally assumes that terms in the WHERE
clause that cannot be used by indices are usually true. If this assumption is incorrect, it could lead to a suboptimal query plan. Use the LIKELY(X)
and UNLIKELY(X)
SQL functions to provide hints to the query planner about clause terms that are probably not true, which helps the query planner to select the best possible plan.
Use LIKELY
/UNLIKELY
to optimize evaluation of OR
/AND
logical expressions. LIKELY
/UNLIKELY
causes the left side of an expression to be evaluated first. This allows the right side of the query to be skipped when possible. For example, in the clause UNLIKELY(A) AND B
, if A
evaluates to FALSE
, B
does not need to be evaluated.
Consider the following:
If x
is one of the values 7
, 8
, 9
, or 10
, the filter y > 42
is applied. If x
is not one of those values, the filter y > 42
is not applied.
If a join column name or alias is not unique, it must be prefixed by its table name.
You can use BIGINT, INTEGER, SMALLINT, TINYINT, DATE, TIME, TIMESTAMP, or TEXT ENCODING DICT data types. TEXT ENCODING DICT is the most efficient because corresponding dictionary IDs are sequential and span a smaller range than, for example, the 65,535 values supported in a SMALLINT field. Depending on the number of values in your field, you can use TEXT ENCODING DICT(32) (up to approximately 2,150,000,000 distinct values), TEXT ENCODING DICT(16) (up to 64,000 distinct values), or TEXT ENCODING DICT(8) (up to 255 distinct values). For more information, see Data Types and Fixed Encoding.
When possible, joins involving a geospatial operator (such as ST_Contains
) build a binned spatial hash table (overlaps hash join), falling back to a Cartesian loop join if a spatial hash join cannot be constructed.
The enable-overlaps-hashjoin
flag controls whether the system attempts to use the overlaps spatial join strategy (true
by default). If enable-overlaps-hashjoin
is set to false, or if the system cannot build an overlaps hash join table for a geospatial join operator, the system attempts to fall back to a loop join. Loop joins can be performant in situations where one or both join tables have a small number of rows. When both tables grow large, loop join performance decreases.
Two flags control whether or not the system allows loop joins for a query (geospatial for not): allow-loop-joins
and trivial-loop-join-threshold
. By default, allow-loop-joins
is set to false
and trivial-loop-join-threshold
to 1,000 (rows). If allow allow-loop-joins
is set to true
, the system allows any query with a loop join, regardless of table cardinalities (measured in number of rows). If left to the implicit default of false
or set explicitly to false
, the system allows loop join queries as long as the inner table (right-side table) has fewer rows than the threshold specified by trivial-loop-join-threshold
.
For optimal performance, the system should utilize overlaps hash joins whenever possible. Use the following guidelines to maximize the use of the overlaps hash join framework and minimize fallback to loop joins when conducting geospatial joins:
The inner (right-side) table should always be the more complicated primitive. For example, for ST_Contains(polygon, point)
, the point table should be the outer (left) table and the polygon table should be the inner (right) table.
Currently, ST_CONTAINS
and ST_INTERSECTS
joins between point and polygons/multi-polygon tables, and ST_DISTANCE < {distance}
between two point tables are supported for accelerated overlaps hash join queries.
For pointwise-distance joins, only the pattern WHERE ST_DISTANCE(table_a.point_col, table_b.point_col) < distance_in_degrees
supports overlaps hash joins. Patterns like the following fall back to loop joins:
WHERE ST_DWITHIN(table_a.point_col, table_b.point_col, distance_in_degrees)
WHERE ST_DISTANCE(ST_TRANSFORM(table_a.point_col, 900913), ST_TRANSFORM(table_b.point_col, 900913)) < 100
You can create joins in a distributed environment in two ways:
Replicate small dimension tables that are used in the join.
Create a shard key on the column used in the join (note that there is a limit of one shard key per table). If the column involved in the join is a TEXT ENCODED field, you must create a SHARED DICTIONARY that references the FACT table key you are using to make the join.
The join order for one small table and one large table matters. If you swap the sales and customer tables on the join, it throws an exception stating that table "sales" must be replicated.
Operator
Description
AND
Logical AND
NOT
Negates value
OR
Logical OR
Expression
Description
CASE WHEN condition THEN result ELSE default END
Case operator
COALESCE(val1, val2, ..)
Returns the first non-null value in the list
Geospatial and array column projections are not supported in the COALESCE
function and CASE expressions.
Expression
Description
expr IN (subquery or list of values)
Evaluates whether expr equals any value of the IN list.
expr NOT IN (subquery or list of values)
Evaluates whether expr does not equal any value of the IN list.
You can use a subquery anywhere an expression can be used, subject to any runtime constraints of that expression. For example, a subquery in a CASE statement must return exactly one row, but a subquery can return multiple values to an IN expression.
You can use a subquery anywhere a table is allowed (for example, FROM
subquery), using aliases to name any reference to the table and columns returned by the subquery.
The SELECT command returns a set of records from one or more tables.
For more information, see SELECT.
Sort order defaults to ascending (ASC).
Sorts null values after non-null values by default in an ascending sort, before non-null values in a descending sort. For any query, you can use NULLS FIRST to sort null values to the top of the results or NULLS LAST to sort null values to the bottom of the results.
Allows you to use a positional reference to choose the sort column. For example, the command SELECT colA,colB FROM table1 ORDER BY 2
sorts the results on colB
because it is in position 2.
HEAVY.AI provides various query hints for controlling the behavior of the query execution engine.
SELECT hints must appear first, immediately after the SELECT statement; otherwise, the query fails.
By default, a hint is applied to the query step in which it is defined. If you have multiple SELECT clauses and define a query hint in one of those clauses, the hint is applied only to the specific query step; the rest of the query steps are unaffected. For example, applying the /* cpu_mode */
hint affects only the SELECT clause in which it exists.
You can define a hint to apply to all query steps by prepending g_
to the query hint. For example, if you define /*+ g_cpu_mode */
, CPU execution is applied to all query steps.
HEAVY.AI supports the following query hints.
The marker hint type represents a Boolean flag.
allow_loop_join
Enable loop joins.
SELECT /+* allow_loop_join */ ...
cpu_mode
Force CPU execution mode.
SELECT /*+ cpu_mode */ ...
columnar_output
Enable columnar output for the input query.
SELECT /+* columnar_output */ ...
disable_loop_join
Disable loop joins.
SELECT /+* disable_loop_join */ ...
dynamic_watchdog
Enable dynamic watchdog.
SELECT /+* dynamic_watchdog */ ...
dynamic_watchdog_off
Disable dynamic watchdog.
SELECT /+* dynamic_watchdog_off */ ...
keep_result
Add result set of the input query to the result set cache.
SELECT /+* keep_result */ ...
keep_table_function_result
Add result set of the table function query to the result set cache.
SELECT /+* keep_table_function_result */ ...
overlaps_allow_gpu_build
Use GPU (if available) to build an overlaps join hash table. (CPU is used by default.)
SELECT /+* overlaps_allow_gpu_build */ ...
overlaps_no_cache
Skip adding an overlaps join hash table to the hash table cache.
SELECT /+* overlaps_no_cache */ ...
rowwise_output
Enable row-wise output for the input query.
SELECT /+* rowwise_output */ ...
watchdog
Enable watchdog.
SELECT /+* watchdog */ ...
watchdog_off
Disable watchdog.
SELECT /+* watchdog_off */ ...
The key-value pair type is a hint name and its value.
aggregate_tree_fanout
Defines a fan out of a tree used to compute window aggregation over frame. Depending on the frame size, the tree fanout affects the performance of aggregation and the tree construction for each window function with a frame clause.
Value type: INT
Range: 0-1024
SELECT /+* aggregate_tree_fanout(32) */ SUM(y) OVER (ORDER BY x ROWS BETWEEN ...) ...
loop_join_inner_table_max_num_rows
Set the maximum number of rows available for a loop join.
Value type: INT
Range: 0 < x
Set the maximum number of rows to 100:
SELECT /+* loop_join_inner_table_max_num_rows(100) */ ...
max_join_hash_table_size
Set the maximum size of the hash table.
Value type: INT
Range: 0 < x
Set the maximum size of the join hash table to 100:
SELECT /+* max_join_hash_table_size(100) */ ...
overlaps_bucket_threshold
Set the overlaps bucket threshold.
Value type: DOUBLE
Range: 0-90
Set the overlaps threshold to 10:
SELECT /*+ overlaps_bucket_threshold(10.0) */ ...
overlaps_max_size
Set the maximum overlaps size.
Value type: INTEGER
Range: >=0
Set the maximum overlap to 10:
SELECT /*+ overlaps_max_size(10.0) */ ...
overlaps_keys_per_bin
Set the number of overlaps keys per bin.
Value type: DOUBLE
Range: 0.0 < x < double::max
SELECT /+* overlaps_keys_per_bin(0.1) */ ...
query_time_limit
Set the maximum time for the query to run.
Value type: INTEGER
Range: >=0
SELECT /+* query_time_limit(1000) */ ...
In Release 6.4 and higher, you can run SELECT queries across tables in different databases on the same HEAVY.AI cluster without having to first connect to those databases. This enables more efficient storage and memory utilization by eliminating the need for table duplication across databases, and simplifies access to shared data and tables.
To execute queries against another database, you must have ACCESS privilege on that database, as well as SELECT privilege.
Execute a join query involving a table in the current database and another table in the my_other_db
database:
Use SHOW
commands to get information about databases, tables, and user sessions.
Shows the CREATE SERVER statement that could have been used to create the server.
Shows the CREATE TABLE statement that could have been used to create the table.
Retrieve the databases accessible for the current user, showing the database name and owner.
Show registered compile-time UDFs and extension functions in the system and their arguments.
Displays a list of all row-level security (RLS) policies that exist for a user or role; admin rights are required. If EFFECTIVE is used, the list also includes any policies that exist for all roles that apply to the requested user or role.
Returns a list of queued queries in the system; information includes session ID, status, query string, account login name, client address, database name, and device type (CPU or GPU).
Admin users can see and interrupt all queries, and non-admin users can see and interrupt only their own queries
NOTE: SHOW QUERIES is only available if the runtime query interrupt parameter (enable-runtime-query-interrupt
) is set.
To interrupt a query in the queue, see KILL QUERY.
If included with a name, lists the role granted directly to a user or role. SHOW EFFECTIVE ROLES with a name lists the roles directly granted to a user or role, and also lists the roles indirectly inherited through the directly granted roles.
If the user name or role name is omitted, then a regular user sees their own roles, and a superuser sees a list of all roles existing in the system.
Show user-defined runtime functions and table functions.
Show data connectors.
Displays storage-related information for a table, such as the table ID/name, number of data/metadata files used by the table, total size of data/metadata files, and table epoch values.
You can see table details for all tables that you have access to in the current database, or for only those tables you specify.
Show details for all tables you have access to:
Show details for table omnisci_states
:
The number of columns returned includes system columns. As a result, the number of columns in column_count
can be up to two greater than the number of columns created by the user.
Displays the list of available system (built-in) table functions.
For more information, see System Table Functions.
Show detailed output information for the specified table function. Output details vary depending on the table function specified.
View SHOW output for the generate_series
table function:
name
generate_series
signature
(i64 series_start, i64 series_stop, i64 series_step)
(i64 series_start, i64 series_stop) -> Column
input_names
series_start, series_stop, series_step
series_start, series_stop
input_types
i64
output_names
generate_series
output_types
Column i64
CPU
true
GPU
true
runtime
false
filter_table_transpose
false
Retrieve the servers accessible for the current user.
Retrieve the tables accessible for the current user.
Lists name, ID, and default database for all or specified users for the current database. If the command is issued by a superuser, login permission status is also shown. Only superusers see users who do not have permission to log in.
SHOW [ALL] USER DETAILS lists name, ID, superuser status, default database, and login permission status for all users across the HeavyDB instance. This variant of the command is available only to superusers. Regular users who run the SHOW ALL USER DETAILS command receive an error message.
Show all user details for all users:
Show all user details for specified users ue, ud, ua, and uf:
If a specified user is not found, the superuser sees an error message:
Show user details for specified users ue, ud, and uf:
Show user details for all users:
Running SHOW ALL USER DETAILS results in an error message:
Show user details for all users:
If a specified user is not found, the user sees an error message:
Show user details for user ua:
Retrieve all persisted user sessions, showing the session ID, user login name, client address, and database name. Admin or superuser privileges required.
Interrupt a queued query. Specify the query by using its session ID.
To see the queries in the queue, use the SHOW QUERIES command:
To interrupt the last query in the list (ID 946-ooNP
):
Showing the queries again indicates that 946-ooNP
has been deleted:
KILL QUERY is only available if the runtime query interrupt parameter (enable-runtime-query-interrupt
) is set.
Interrupting a query in ‘PENDING_QUEUE’ status is supported in both distributed and single-server mode.
To enable query interrupt for tables imported from data files in local storage, set enable_non_kernel_time_query_interrupt
to TRUE. (It is enabled by default.)
HEAVY.AI supports arrays in dictionary-encoded text and number fields (TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, and DOUBLE). Data stored in arrays are not normalized. For example, {green,yellow} is not the same as {yellow,green}. As with many SQL-based services, OmniSci array indexes are 1-based.
HEAVY.AI supports NULL variable-length arrays for all integer and floating-point data types, including dictionary-encoded string arrays. For example, you can insert NULL
into BIGINT[ ], DOUBLE[ ], or TEXT[ ] columns. HEAVY.AI supports NULL fixed-length arrays for all integer and floating-point data types, but not for dictionary-encoded string arrays. For example, you can insert NULL
into BIGINT[2] DOUBLE[3], but not into TEXT[2] columns.
ArrayCol[n] ...
Returns value(s) from specific location n
in the array.
UNNEST(ArrayCol)
Extract the values in the array to a set of rows. Requires GROUP BY
; projecting UNNEST
is not currently supported.
test = ANY ArrayCol
ANY
compares a scalar value with a single row or set of values in an array, returning results in which at least one item in the array matches. ANY
must be preceded by a comparison operator.
test = ALL ArrayCol
ALL
compares a scalar value with a single row or set of values in an array, returning results in which all records in the array field are compared to the scalar value. ALL
must be preceded by a comparison operator.
CARDINALITY()
Returns the number of elements in an array. For example:
The following examples show query results based on the table test_array
created with the following statement:
The following queries use arrays in an INTEGER field:
Functions and Operators (DML)
Parenthesization
Multiplication and division
Addition and subtraction
Usage Notes
The following wildcard characters are supported by LIKE
and ILIKE
:
%
matches any number of characters, including zero characters.
_
matches exactly one character.
Supported date_part types:
Supported interval types:
For two-digit years, years 69-99 are assumed to be previous century (for example, 1969), and 0-68 are assumed to be current century (for example, 2016).
For four-digit years, negative years (BC) are not supported.
Hours are expressed in 24-hour format.
When time components are separated by colons, you can write them as one or two digits.
Months are case insensitive. You can spell them out or abbreviate to three characters.
For timestamps, decimal seconds are ignored. Time zone offsets are written as +/-HHMM.
For timestamps, a numeric string is converted to +/- seconds since January 1, 1970. Supported timestamps range from -30610224000 (January 1, 1000) through 29379456000 (December 31, 2900).
On output, dates are formatted as YYYY-MM-DD. Times are formatted as HH:MM:SS.
Linux EPOCH values range from -30610224000 (1/1/1000) through 185542587100800 (1/1/5885487). Complete range in years: +/-5,883,517 around epoch.
Both double-precision (standard) and single-precision floating point statistical functions are provided. Single-precision functions run faster on GPUs but might cause overflow errors.
COUNT(DISTINCT
x
)
, especially when used in conjunction with GROUP BY, can require a very large amount of memory to keep track of all distinct values in large tables with large cardinalities. To avoid this large overhead, use APPROX_COUNT_DISTINCT.
APPROX_COUNT_DISTINCT(
x
,
e
)
gives an approximate count of the value x, based on an expected error rate defined in e. The error rate is an integer value from 1 to 100. The lower the value of e, the higher the precision, and the higher the memory cost. Select a value for e based on the level of precision required. On large tables with large cardinalities, consider using APPROX_COUNT_DISTINCT
when possible to preserve memory. When data cardinalities permit, OmniSci uses the precise implementation of COUNT(DISTINCT
x
)
for APPROX_COUNT_DISTINCT
. Set the default error rate using the -hll-precision-bits
configuration parameter.
The accuracy of APPROX_MEDIAN (
x
)
upon the distribution of data. For example:
For 100,000,000 integers (1, 2, 3, ... 100M) in random order, APPROX_MEDIAN can provide a highly accurate answer 5+ significant digits.
For 100,000,001 integers, where 50,000,000 have value of 0 and 50,000,001 have value of 1, APPROX_MEDIAN returns a value close to 0.5, even though the median is 1.
Currently, OmniSci does not support grouping by non-dictionary-encoded strings. However, with the SAMPLE
aggregate function, you can select non-dictionary-encoded strings that are presumed to be unique in a group. For example:
If the aggregated column (user_description in the example above) is not unique within a group, SAMPLE
selects a value that might be nondeterministic because of the parallel nature of OmniSci query execution.
You can create your own C++ functions and use them in your SQL queries.
User-defined Functions (UDFs) require clang++ version 9. You can verify the version installed using the command clang++ --version
.
UDFs currently allow any authenticated user to register and execute a runtime function. By default, runtime UDFs are globally disabled but can be enabled with the runtime flag enable-runtime-udf
.
Create your function and save it in a .cpp file; for example, /var/lib/omnisci/udf_myFunction.cpp.
Add the UDF configuration flag to omnisci.conf. For example:
Use your function in a SQL query. For example:
This function, udf_diff.cpp, returns the difference of two values from a table.
Include the standard integer library, which supports the following datatypes:
bool
int8_t (cstdint), char
int16_t (cstdint), short
int32_t (cstdint), int
int64_t (cstdint), size_t
float
double
void
The next four lines are boilerplate code that allows OmniSci to determine whether the server is running with GPUs. OmniSci chooses whether it should compile the function inline to achieve the best possible performance.
The next line is the actual user-defined function, which returns the difference between INTEGER values x and y.
To run the udf_diff
function, add this line to your /var/lib/omnisci/omnisci.conf file (in this example, the .cpp file is stored at /var/lib/omnisci/udf_diff.cpp):
Restart the OmniSci server.
Use your command from an OmniSci SQL client to query, for example, a table named myTable that contains the INTEGER columns myInt1
and myInt2
.
OmniSci returns the difference as an INTEGER value.
HEAVY.AI provides access to a set system-provided table functions, also known as table-valued functions (TVS). System table functions, like user-defined table functions, support execution of queries on both CPU and GPU over one or more SQL result-set inputs. Table function support in HEAVY.AI can be split into two broad categories: system table functions and user-defined table functions (UDTFs). System table functions are built-in to the HEAVY.AI server, while UDTFs can be declared dynamically at run-time by specifying them in , a subset of the Python language. For more information on UDTFs, see .
To improve performance, table functions can be declared to enable filter pushdown optimization, which allows the Calcite optimizer to "push down" filters on the output(s) of a table functions to its input(s) when the inputs and outputs are declared to be semantically equivalent (for example, a longitude variable that is input and output from a table function). This can significantly increase performance in cases where only a small portion of one or more input tables is required to compute the filtered output of a table function.
Whether system- or user-provided, table functions can execute over one or more result sets specified by subqueries, and can also take any number of additional constant literal arguments specified in the function definition. SQL subquery inputs can consist of any SQL expression (including multiple subqueries, joins, and so on) allowed by HeavyDB, and the output can be filtered, grouped by, joined, and so on like a normal SQL subquery, including being input into additional table functions by wrapping it in a CURSOR
argument. The number and types of input arguments, as well as the number and types of output arguments, are specified in the table function definition itself.
Table functions allow for the efficient execution of advanced algorithms that may be difficult or impossible to express in canonical SQL. By allowing execution of code directly over SQL result sets, leveraging the same hardware parallelism used for fast SQL execution and visualization rendering, HEAVY.AI provides orders-of-magnitude speed increases over the alternative of transporting large result sets to other systems for post-processing and then returning to HEAVY.AI for storage or downstream manipulation. You can easily invoke system-provided or user-defined algorithms directly inline with SQL and rendering calls, making prototyping and deployment of advanced analytics capabilities easier and more streamlined.
Table functions can take as input arguments both constant literals (including scalar results of subqueries) as well as results of other SQL queries (consisting of one or more rows). The latter (SQL query inputs), per the SQL standard, must be wrapped in the keyword CURSOR
. Depending on the table function, there can be 0, 1, or multiple CURSOR inputs. For example:
Certain table functions can take 1 or more columns of a specified type or types as inputs, denoted as ColumnList<TYPE1 | Type2... TypeN>
. Even if a function allows aColumnList
input of multiple types, the arguments must be all of one type; types cannot be mixed. For example, if a function allows ColumnList<INT | TEXT ENCODING DICT>
, one or more columns of either INTEGER or TEXT ENCODING DICT can be used as inputs, but all must be either INT columns or TEXT ENCODING DICT columns.
All HEAVY.AI system table functions allow you to specify argument either in conventional comma-separated form in the order specified by the table function signature, or alternatively via a key-value map where input argument names are mapped to argument values using the =>
token. For example, the following two calls are equivalent:
For performance reasons, particularly when table functions are used as actual tables in a client like Heavy Immerse, many system table functions in HEAVY.AI automatically "push down" filters on certain output columns in the query onto the inputs. For example, if a table does some computation over an x
and y
range such that x
and y
are in both the input and output for the table function, filter push-down would likely be enabled so that a query like the following would automatically push down the filter on the x and y outputs to the x and y inputs. This potentially increases query performance significantly.
To determine whether filter push-down is used, you can check the Boolean value of the filter_table_transpose
column from the query:
Currently for system table functions, you cannot change push-down behavior.
You can query which table functions are available using SHOW TABLE FUNCTIONS
:
Information about the expected input and output argument names and types, as well as other info such as whether the function can run on CPU, GPU or both, and whether filter push-down is enabled, can be queried via SHOW TABLE FUNCTIONS DETAILS <table_function_name
>;
The following system table functions are available in HEAVY.AI. The table provides a summary and links to more inforamation about each function.
The TABLE
command is required to wrap a table function clause; for example:
select * from TABLE(generate_series(1, 10));
The CURSOR
command is required to wrap any subquery inputs.
HEAVY.AI supports a subset of object types and functions for storing and writing queries for geospatial definitions.
For information about geospatial datatype sizes, see and in .
For more information on WKT primitives, see .
HEAVY.AI supports SRID 4326 () and 900913 (Google Web Mercator), and 32601-32660,32701-32760 (Universal Transverse Mercator (UTM) Zones). When using geospatial fields, you set the SRID to determine which reference system to use. HEAVY.AI does not assign a default SRID.
If you do not set the SRID of the geo field in the table, you can set it in a SQL query using ST_SETSRID(column_name, SRID)
. For example, ST_SETSRID(a.pt,4326)
.
When representing longitude and latitude, the first coordinate is assumed to be longitude in HEAVY.AI geospatial primitives.
You create geospatial objects as geometries (planar spatial data types), which are supported by the planar geometry engine at run time. When you call ST_DISTANCE
on two geometry objects, the engine returns the shortest straight-line planar distance, in degrees, between those points. For example, the following query returns the shortest distance between the point(s) in p1
and the polygon(s) in poly1
:
Geospatial functions that expect geospatial object arguments accept geospatial columns, geospatial objects returned by other functions, or string literals containing WKT representations of geospatial objects. Supplying a WKT string is equivalent to calling a geometry constructor. For example, these two queries are identical:
You can create geospatial literals with a specific SRID. For example:
HEAVY.AI provides support for geography objects and geodesic distance calculations, with some limitations.
HeavyDB supports import from any coordinate system supported by the Geospatial Data Abstraction Library (GDAL). On import, HeavyDB will convert to and store in WGS84 encoding, and rendering is accurate in Immerse.
However, no built-in way to reference the original coordinates currently exists in Immerse, and coordinates exported from Immerse will be WGS84 coordinates. You can work around this limitation by adding to the dataset a column or columns in non-geo format that could be included for display in Immerse (for example, in a popup) or on export.
Currently, HEAVY.AI supports spheroidal distance calculation between:
Two points using either SRID 4326 or 900913.
A point and a polygon/multipolygon using SRID 900913.
Using SRID 900913 results in variance compared to SRID 4326 as polygons approach the North and South Poles.
The following query returns the points and polygons within 1,000 meters of each other:
HEAVY.AI supports the functions listed.
FunctionDescription Special processing is automatically applied to WGS84 input geometries (SRID=4326) to limit buffer distortion:
Implementation first determines the best planar SRID to which to project the 4326 input geometry.
Preferred SRIDs are UTM and Lambert (LAEA) North/South zones, with Mercator used as a fallback.
Buffer distance is interpreted as distance in meters (units of all planar SRIDs being considered).
The input geometry is transformed to the best planar SRID and handed to GEOS, along with buffer distance.
The buffer geometry built by GEOS is then transformed back to SRID=4326 and returned.
Example: Build 10-meter buffer geometries (SRID=4326) with limited distortion:SELECT ST_Buffer(poly4326, 10.0) FROM tbl;
.ST_Centroid
Computes the geometric center of a geometry as a POINT.
You can use SQL code similar to the examples in this topic as global filters in Immerse.
CREATE TABLE AS SELECT
is not currently supported for geo data types in distributed mode.
GROUP BY
is not supported for geo types (POINT
, MULTIPOINT
, LINESTRING
, MULTILINESTRING
, POLYGON
, or MULTIPOLYGON
.
You can use \d table_name
to determine if the SRID is set for the geo field:
If no SRID is returned, you can set the SRID using ST_SETSRID(column_name, SRID)
. For example, ST_SETSRID(myPoint, 4326)
.
HEAVY.AI Free is a full-featured version of the HEAVY.AI platform available at no cost for non-hosted commercial use.
To get started with HEAVY.AI Free:
Go to the , and in the HEAVY.AI Free section, click Get Free License.
On the Get HEAVY.AI Free page, enter your email address and click I Agree.
Open the HEAVY.AI Free Edition Activation Link email that you receive from HEAVY.AI, and click Click Here to view and download the free edition license. You will need this license to run HEAVY.AI after you install it. A copy of the license is also sent to your email.
In the What's Next section, click to select the best version of HEAVY.AI for your hardware and software configuration. Follow the instructions for the download or cloud version you choose.
, using the instructions for your platform.
Verify that OmniSci is working correctly by following the instructions in the Checkpoint section at the end of the installation instructions. For example, the Checkpoint instructions for the CentOS CPU with Tarball installation is .
You can create additional HEAVY.AI users to collaborate with.
Connect to Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273
.
Open the .
Use the CREATE USER command to create a new user. For information on syntax and options, see .
You can download HEAVY.AI for your preferred platform from .
The CPU (no GPUs) install does not support backend rendering. For example, Pointmap and Scatterplot charts are not available. The GPU install supports all chart types.
The Open Source options do not require a license, and do not include Heavy Immerse.
For information about the HeavyRF radio frequency propagation simulation and HeavyRF table functions, see .
For information about importing data, see .
See the tables in below for examples.
Operator
Description
+
numeric
Returns numeric
–
numeric
Returns negative value of numeric
numeric1
+
numeric2
Sum of numeric1
and numeric2
numeric1
–
numeric2
Difference of numeric1
and numeric2
numeric1
*
numeric2
Product of numeric1
and numeric2
numeric1
/
numeric2
Quotient (numeric1
divided by numeric2
)
Operator
Description
=
Equals
<>
Not equals
>
Greater than
>=
Greater than or equal to
<
Less than
<=
Less than or equal to
BETWEEN
x
AND
y
Is a value within a range
NOT BETWEEN
x
AND
y
Is a value not within a range
IS NULL
Is a value that is null
IS NOT NULL
Is a value that is not null
NULLIF(
x
,
y
)
Compare expressions x and y. If different, return x. If they are the same, return null
. For example, if a dataset uses ‘NA’ for null
values, you can use this statement to return null
using SELECT NULLIF(field_name,'NA')
.
IS TRUE
True if a value resolves to TRUE.
IS NOT TRUE
True if a value resolves to FALSE.
Function
Description
ABS(
x
)
Returns the absolute value of x
CEIL(
x
)
Returns the smallest integer not less than the argument
DEGREES(
x
)
Converts radians to degrees
EXP(
x
)
Returns the value of e to the power of x
FLOOR(
x
)
Returns the largest integer not greater than the argument
LN(
x
)
Returns the natural logarithm of x
LOG(
x
)
Returns the natural logarithm of x
LOG10(
x
)
Returns the base-10 logarithm of the specified float expression x
MOD(
x,y
)
Returns the remainder of int x divided by int y
PI()
Returns the value of pi
POWER(
x,y
)
Returns the value of x raised to the power of y
RADIANS(
x
)
Converts degrees to radians
ROUND(
x
)
Rounds x to the nearest integer value, but does not change the data type. For example, the double value 4.1 rounds to the double value 4.
ROUND_TO_DIGIT (
x,y
)
Rounds x to y decimal places
SIGN(
x
)
Returns the sign of x as -1, 0, 1 if x is negative, zero, or positive
SQRT(
x
)
Returns the square root of x.
TRUNCATE(
x,y
)
Truncates x to y decimal places
WIDTH_BUCKET(
target,lower-boundary,upper-boundary,bucket-count
)
Define equal-width intervals (buckets) in a range between the lower boundary and the upper boundary, and returns the bucket number to which the target expression is assigned.
target
- A constant, column variable, or general expression for which a bucket number is returned.
lower-boundary
- Lower boundary for the range of values to be partitioned equally.
upper-boundary
- Upper boundary for the range of values to be partitioned equally.
partition_count
- Number of equal-width buckets in the range defined by the lower and upper boundaries.
Expressions can be constants, column variables, or general expressions.
Example Create 10 age buckets of equal size, with lower bound 0 and upper bound 100 ([0,10], [10,20]... [90,100]), and classify the
age of a customer accordingly:
SELECT WIDTH_BUCKET(age, 0, 100, 10) FROM customer;
For example, a customer of age 34 is assigned to bucket 3 ([30,40]) and the function returns the value 3.
Function
Description
ACOS(
x
)
Returns the arc cosine of x
ASIN(
x
)
Returns the arc sine of x
ATAN(
x
)
Returns the arc tangent of x
ATAN2(
y
,
x
)
Returns the arc tangent of (x, y) in the range (-π,π]. Equal to ATAN(y/x)
for x > 0
.
COS(
x
)
Returns the cosine of x
COT(
x
)
Returns the cotangent of x
SIN(
x
)
Returns the sine of x
TAN(
x
)
Returns the tangent of x
Function
Description
DISTANCE_IN_METERS(
fromLon
,
fromLat
,
toLon
,
toLat
)
Calculates distance in meters between two WGS84 positions.
CONV_4326_900913_X(
x
)
Converts WGS84 latitude to WGS84 Web Mercator x coordinate.
CONV_4326_900913_Y(
y
)
Converts WGS84 longitude to WGS84 Web Mercator y coordinate.
Function
Description
BASE64_DECODE(
str
)
Decodes a BASE64-encoded string.
BASE64_ENCODE(
str
)
Encodes a string to a BASE64-encoded string.
CHAR_LENGTH(
str
)
Returns the number of characters in a string. Only works with unencoded fields (ENCODING set to none
).
str1
|| str2
[ || str3
... ]
Returns the string that results from concatenating the strings specified. Note that numeric, date, timestamp, and time types will be implicitly casted to strings as necessary, so explicit casts of non-string types to string types is not required for inputs to the concatenation operator.
Note that concatenating a variable string with a string literal, i.e. county_name || ' County'
is significantly more performant than concatenating two or more variable strings, i.e. county_name || ', ' || state_name. Hence for
for multi-variable string concatenation, it is recommended to use an update statement to materialize the concatenated output rather than performing it inline when such operations are expected to be routinely repeated.
ENCODE_TEXT(
none_encoded_str
)
Converts a none-encoded string to a transient dictionary-encoded string to allow for operations like group-by on top. When the watchdog is enabled, the number of strings that can be casted using this operator is capped by the value set with the watchdog-none-encoded-string-translation-limit
flag (1,000,000 by default).
INITCAP(
str
)
Returns the string with initial caps after any of the defined delimiter characters, with the remainder of the characters lowercased. Valid delimiter characters are !
, ?
, @
, "
, ^
, #
, $
, &
, ~
, _
, ,
, .
, :
, ;
, +
, -
, *
, %
, /
, |
, \
, [
, ]
, (
, )
, {
, }
, <
, >
.
JSON_VALUE(
json_str, path
)
Returns the string of a field given by path in
str. Paths start with the $
character, with sub-fields split by .
and array members indexed by []
, with array indices starting at 0. For example, JSON_VALUE('{"name": "Brenda", "scores": [89, 98, 94]}', '$.scores[1]')
would yield a TEXT return field of '98'
.
Note that currentlyLAX
parsing mode (any unmatched path returns null rather than errors) is the default, and STRICT
parsing mode is not supported.
KEY_FOR_STRING(
str
)
Returns the dictionary key of a dictionary-encoded string column.
LCASE(
str
)
Returns the string in all lower case. Only ASCII character set is currently supported. Same as LOWER
.
LEFT(
str, num
)
Returns the left-most number (num
) of characters in the string (str
).
LENGTH(
str
)
Returns the length of a string in bytes. Only works with unencoded fields (ENCODING set to none
).
LOWER(
str
)
Returns the string in all lower case. Only ASCII character set is currently supported. Same as LCASE
.
LPAD(
str
,
len
, [
lpad_str
])
Left-pads the string with the string defined in lpad_str
to a total length of len
. If the optional lpad_str
is not specified, the space character is used to pad.
If the length of str
is greater than len
, then characters from the end of str
are truncated to the length of len
.
Characters are added from lpad_str
successively until the target length len
is met. If lpad_str
concatenated with str
is not long enough to equal the target len
, lpad_str
is repeated, partially if necessary, until the target length is met.
LTRIM(
str
,
chars
)
Removes any leading characters specified in chars
from the string. Alias for TRIM
.
OVERLAY(
str
PLACING
replacement_str
FROM
start
[FOR
len
])
Replaces in str
the number of characters defined in len
with characters defined in replacement_str
at the location start
.
Regardless of the length of replacement_str
, len
characters are removed from str
unless start
+ replacement_str
is greater than the length of str
, in which case all characters from start
to the end of str
are replaced.
Ifstart
is negative, it specifies the number of characters from the end of str
.
POSITION (
search_str
IN
str
[FROM
start_position
])
Returns the position of the first character in search_str
if found in str
, optionally starting the search at start_position
.
If search_str
is not found, 0 is returned. If search_str
or str
are null, null is returned.
REGEXP_REPLACE(
str
,
pattern
[,
new_str
,
position
,
occurrence
, [
flags
]])
Replace one or all matches of a substring in string str
that matches pattern
, which is a regular expression in POSIX regex syntax.
new_str
(optional) is the string that replaces the string matching the pattern. If new_str
is empty or not supplied, all found matches are removed.
The occurrence
integer argument (optional) specifies the single match occurrence of the pattern to replace, starting from the beginning of str
; 0 (replace all) is the default. Use a negative occurrence
argument to signify the nth-to-last occurrence to be replaced.
pattern
uses POSIX regular expression syntax.
Use a positive position
argument to indicate the number of characters from the beginning of str
. Use a negative position
argument to indicate the number of characters from the end of str
.
Back-references/capture groups can be used to capture and replace specific sub-expressions.
Use the following optional flags
to control the matching behavior:
c
- Case-sensitive matching.
i
- Case-insensitive matching.
If not specified, REGEXP_REPLACE defaults to case sensitive search.
REGEXP_SUBSTR(
str
,
pattern
[,
position
,
occurrence
,
flags
, group_num
])
Search string str
for pattern
, which is a regular expression in POSIX syntax, and return the matching substring.
Use position
to set the character position to begin searching. Use occurrence
to specify the occurrence of the pattern to match.
Use a positive position
argument to indicate the number of characters from the beginning of str
. Use a negative position
argument to indicate the number of characters from the end of str
.
The occurrence
integer argument (optional) specifies the single match occurrence of the pattern to replace, with 0 being mapped to the first (1) occurrence. Use a negative occurrence
argument to signify the nth-to-last group in pattern
is returned.
Use optional flags
to control the matching behavior:
c
- Case-sensitive matching.
e
- Extract submatches.
i
- Case-insensitive matching.
The c
and i
flags cannot be used together; e
can be used with either. If neither c
nor i
are specified, or if pattern
is not provided, REGEXP_SUBSTR defaults to case-sensitive search.
If the e
flag is used, REGEXP_SUBSTR returns the capture group group_num
of pattern
matched in str
. If the e
flag is used, but no capture groups are provided in pattern
, REGEXP_SUBSTR returns the entire matching pattern
, regardless of group_num
. If the e flag is used but no group_num
is provided, a value of 1 for group_num
is assumed, so the first capture group is returned.
REPEAT(
str
,
num
)
Repeats the string the number of times defined in num
.
REPLACE(
str
,
from_str
,
new_str
)
Replaces all occurrences of substring from_str
within a string, with a new substring new_str
.
REVERSE(
str
)
Reverses the string.
RIGHT(
str, num
)
Returns the right-most number (num
) of characters in the string (str
).
RPAD(
str
,
len
,
rpad_str
)
Right-pads the string with the string defined in rpad_str
to a total length of len
. If the optional rpad_str
is not specified, the space character is used to pad.
If the length of str
is greater than len
, then characters from the beginning of str
are truncated to the length of len
.
Characters are added from rpad_str
successively until the target length len
is met. If rpad_str
concatenated with str
is not long enough to equal the target len
, rpad_str
is repeated, partially if necessary, until the target length is met.
RTRIM(
str
)
Removes any trailing spaces from the string.
SPLIT_PART(
str
,
delim
,
field_num
)
Split the string based on a delimiter delim
and return the field identified by field_num
. Fields are numbered from left to right.
STRTOK_TO_ARRAY(
str
, [
delim
])
Tokenizes the string str
using optional delimiter(s) delim
and returns an array of tokens.
An empty array is returned if no tokens are produced in tokenization. NULL is returned if either parameter is a NULL.
SUBSTR(
str
,
start
, [
len
])
Alias for SUBSTRING
.
SUBSTRING(
str FROM
start [ FOR
len
])
Returns a substring of str
starting at index start
for len
characters.
The start position is 1-based (that is, the first character of str
is at index 1, not 0). However, start
0 aliases to start
1.
If start
is negative, it is considered to be |start|
characters from the end of the string.
If len
is not specified, then the substring from start
to the end of str
is returned.
If len
is not specified, then the substring from start
to the end of str is returned.
If start
+ len
is greater than the length of str
, then the characters in str
from start
to the end of the string are returned.
TRIM([BOTH | LEADING | TRAILING] [
trim_str
FROM
str
])
Removes characters defined in trim_str
from the beginning, end, or both of str
. If trim_str
is not specified, the space character is the default.
If the trim location is not specified, defined characters are trimmed from both the beginning and end of str
.
TRY_CAST( str AS type)
Attempts to cast/convert a string type to any valid numeric, timestamp, date, or time type. If the conversion cannot be performed, null is returned.
Note that TRY_CAST
is not valid for non-string input types.
UCASE(
str
)
Returns the string in uppercase format. Only ASCII character set is currently supported. Same as UPPER
.
UPPER(
str
)
Returns the string in uppercase format. Only ASCII character set is currently supported. Same as UCASE
.
Name
Example
Description
str
LIKE
pattern
'ab' LIKE 'ab'
Returns true if the string matches the pattern (case-sensitive)
str
NOT LIKE
pattern
'ab' NOT LIKE 'cd'
Returns true if the string does not match the pattern
str
ILIKE
pattern
'AB' ILIKE 'ab'
Returns true if the string matches the pattern (case-insensitive). Supported only when the right side is a string literal; for example, colors.name ILIKE 'b%
str
REGEXP
POSIX pattern
'^[a-z]+r$'
Lowercase string ending with r
REGEXP_LIKE (
str
,
POSIX pattern
)
'^[hc]at'
cat or hat
Function
Description
CURRENT_DATE
CURRENT_DATE()
Returns the current date in the GMT time zone.
Example:
SELECT CURRENT_DATE();
CURRENT_TIME
CURRENT_TIME()
Returns the current time of day in the GMT time zone.
Example:
SELECT CURRENT_TIME();
CURRENT_TIMESTAMP
CURRENT_TIMESTAMP()
Return the current timestamp in the GMT time zone. Same as NOW()
.
Example:
SELECT CURRENT_TIMESTAMP();
DATEADD(
'date_part'
,
interval
,
date
|
timestamp
)
Returns a date after a specified time/date interval has been added.
Example:
SELECT DATEADD('MINUTE', 6000, dep_timestamp) Arrival_Estimate FROM flights_2008_10k LIMIT 10;
DATEDIFF(
'date_part'
,
date
,
date
)
Returns the difference between two dates, calculated to the lowest level of the date_part you specify. For example, if you set the date_part as DAY, only the year, month, and day are used to calculate the result. Other fields, such as hour and minute, are ignored.
Example:
SELECT DATEDIFF('YEAR', plane_issue_date, now()) Years_In_Service FROM flights_2008_10k LIMIT 10;
DATEPART(
'interval'
,
date
|
timestamp
)
Returns a specified part of a given date or timestamp as an integer value. Note that 'interval' must be enclosed in single quotes.
Example:
SELECT DATEPART('YEAR', plane_issue_date) Year_Issued FROM flights_2008_10k LIMIT 10;
DATE_TRUNC(
date_part
,
timestamp
)
Truncates the timestamp to the specified date_part. DATE_TRUNC(week,...)
starts on Monday (ISO), which is different than EXTRACT(dow,...)
, which starts on Sunday.
Example:
SELECT DATE_TRUNC(MINUTE, arr_timestamp) Arrival FROM flights_2008_10k LIMIT 10;
EXTRACT(
date_part
FROM
timestamp
)
Returns the specified date_part from timestamp.
Example:
SELECT EXTRACT(HOUR FROM arr_timestamp) Arrival_Hour FROM flights_2008_10k LIMIT 10;
INTERVAL
'count'
date_part
Adds or Subtracts count date_part units from a timestamp. Note that 'count' is enclosed in single quotes.
Example:
SELECT arr_timestamp + INTERVAL '10' YEAR FROM flights_2008_10k LIMIT 10;
NOW()
Return the current timestamp in the GMT time zone. Same as CURRENT_TIMESTAMP().
Example:
NOW();
TIMESTAMPADD(
date_part
,
count
,
timestamp
|
date
)
Adds an interval of count date_part to timestamp or date and returns signed date_part units in the provided timestamp or date form.
Example:
SELECT TIMESTAMPADD(DAY, 14, arr_timestamp) Fortnight FROM flights_2008_10k LIMIT 10;
TIMESTAMPDIFF(
date_part
,
timestamp1
,
timestamp2
)
Subtracts timestamp1 from timestamp2 and returns the result in signed date_part units.
Example:
SELECT TIMESTAMPDIFF(MINUTE, arr_timestamp, dep_timestamp) Flight_Time FROM flights_2008_10k LIMIT 10;
Datatype
Formats
Examples
DATE
YYYY-MM-DD
2013-10-31
DATE
MM/DD/YYYY
10/31/2013
DATE
DD-MON-YY
31-Oct-13
DATE
DD/Mon/YYYY
31/Oct/2013
EPOCH
1383262225
TIME
HH:MM
23:49
TIME
HHMMSS
234901
TIME
HH:MM:SS
23:49:01
TIMESTAMP
DATE TIME
31-Oct-13 23:49:01
TIMESTAMP
DATETTIME
31-Oct-13T23:49:01
TIMESTAMP
DATE:TIME
11/31/2013:234901
TIMESTAMP
DATE TIME ZONE
31-Oct-13 11:30:25 -0800
TIMESTAMP
DATE HH.MM.SS PM
31-Oct-13 11.30.25pm
TIMESTAMP
DATE HH:MM:SS PM
31-Oct-13 11:30:25pm
TIMESTAMP
1383262225
Double-precision FP Function
Single-precision FP Function
Description
AVG(
x
)
Returns the average value of x
COUNT()
Returns the count of the number of rows returned
COUNT(DISTINCT
x
)
Returns the count of distinct values of x
APPROX_COUNT_DISTINCT(
x
,
e
)
Returns the approximate count of distinct values of x with defined expected error rate e, where e is an integer from 1 to 100. If no value is set for e, the approximate count is calculated using the system-widehll-precision-bits
configuration parameter.
APPROX_MEDIAN(
x
)
Returns the approximate median of x. Two server configuration parameters affect memory usage:
<code></code>approx_quantile_centroids
<code></code>
<code></code>approx_quantile_buffer
Accuracy of APPROX_MEDIAN depends on the distribution of data; see Usage Notes.
APPROX_PERCENTILE(
x
,
y
)
Returns the approximate quantile of x
, where y
is the value between 0 and 1.
For example, y=0
returns MIN(x)
, y=1
returns MAX(x)
, and y=0.5
returns APPROX_MEDIAN(x)
.
MAX(
x
)
Returns the maximum value of x
MIN(
x
)
Returns the minimum value of x
SINGLE_VALUE
Returns the input value if there is only one distinct value in the input; otherwise, the query fails.
SUM(
x
)
Returns the sum of the values of x
SAMPLE(
x
)
Returns one sample value from aggregated column x. For example, the following query returns population grouped by city, along with one value from the state column for each group:
Note: This was previously LAST_SAMPLE
, which is now deprecated.
CORRELATION(x, y)
CORRELATION_FLOAT(x, y)
Alias of CORR. Returns the coefficient of correlation of a set of number pairs.
CORR(x, y)
CORR_FLOAT(x, y)
Returns the coefficient of correlation of a set of number pairs.
COUNT_IF(conditional_expr)
Returns the number of rows satisfying the given condition_expr
.
COVAR_POP(x, y)
COVAR_POP_FLOAT(x, y)
Returns the population covariance of a set of number pairs.
COVAR_SAMP(x, y)
COVAR_SAMP_FLOAT(x, y)
Returns the sample covariance of a set of number pairs.
STDDEV(x)
STDDEV_FLOAT(x)
Alias of STDDEV_SAMP. Returns sample standard deviation of the value.
STDDEV_POP(x)
STDDEV_POP_FLOAT(x)
Returns the population standard the standard deviation of the value.
STDDEV_SAMP(x)
STDDEV_SAMP_FLOAT(x)
Returns the sample standard deviation of the value.
SUM_IF(conditional_expr)
Returns the sum of all expression values satisfying the given condition_expr
.
VARIANCE(x)
VARIANCE_FLOAT(x)
Alias of VAR_SAMP. Returns the sample variance of the value.
VAR_POP(x)
VAR_POP_FLOAT(x)
Returns the population variance sample variance of the value.
VAR_SAMP(x)
VAR_SAMP_FLOAT(x)
Returns the sample variance of the value.
Function
Description
SAMPLE_RATIO(
x
)
Returns a Boolean value, with the probability of True
being returned for a row equal to the input argument. The input argument is a numeric value between 0.0 and 1.0. Negative input values (return False
), input values greater than 1.0 returns True
, and null input values return False
.
The result of the function is deterministic per row; that is, all calls of the operator for a given row return the same result. The sample ratio is probabilistic, but is generally within a thousandth of a percentile of the actual range when the underlying dataset is millions of records or larger.
The following example filters approximately 50% of the rows from t
and returns a count that is approximately half the number of rows in t
:
SELECT COUNT(*) FROM t WHERE SAMPLE_RATIO(0.5)
Expression
Example
Description
CAST(expr AS type
)
CAST(1.25 AS FLOAT)
Converts an expression to another data type. For conversions to a TEXT type, use TRY_CAST.
TRY_CAST(text_expr AS type
)
CAST('1.25' AS FLOAT)
Converts a text to a non-text type, returning null if the conversion could not be successfully performed.
ENCODE_TEXT(none_encoded_str
)
ENCODE_TEXT(long_str)
Converts a none-encoded text type to a dictionary-encoded text type.
FROM/TO:
TINYINT
SMALLINT
INTEGER
BIGINT
FLOAT
DOUBLE
DECIMAL
TEXT
BOOLEAN
DATE
TIME
TIMESTAMP
TINYINT
-
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
n/a
SMALLINT
Yes
-
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
n/a
INTEGER
Yes
Yes
-
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
BIGINT
Yes
Yes
Yes
-
Yes
Yes
Yes
Yes
No
No
No
No
FLOAT
Yes
Yes
Yes
Yes
-
Yes
No
Yes
No
No
No
No
DOUBLE
Yes
Yes
Yes
Yes
Yes
-
No
Yes
No
No
No
n/a
DECIMAL
Yes
Yes
Yes
Yes
Yes
Yes
-
Yes
No
No
No
n/a
TEXT
Yes (Use TRY_CAST)
Yes (Use TRY_CAST)
Yes (Use TRY_CAST)
Yes (Use TRY_CAST)
Yes (Use TRY_CAST)
Yes (Use TRY_CAST)
Yes (Use TRY_CAST)
-
Yes (Use TRY_CAST)
Yes (Use TRY_CAST)
Yes (Use TRY_CAST)
Yes (Use TRY_CAST)
BOOLEAN
No
No
Yes
No
No
No
No
Yes
-
n/a
n/a
n/a
DATE
No
No
No
No
No
No
No
Yes
n/a
-
No
Yes
TIME
No
No
No
No
No
No
No
Yes
n/a
No
-
n/a
TIMESTAMP
No
No
No
No
No
No
No
Yes
n/a
Yes
No
-
Generates random string data.
Generates a series of integer values.
Generates a series of timestamp values from start_timestamp
to end_timestamp
.
Given a query input with entity keys and timestamps, and parameters specifying the minimum session time, the minimum number of session records, and the max inactive seconds, outputs all unique sessions found in the data with the duration of the session.
Given a query input of entity keys/IDs, a set of feature columns, and a metric column, scores each pair of entities based on their similarity. The score is computed as the cosine similarity of the feature column(s) between each entity pair, which can optionally be TF/IDF weighted.
Given a query input of entity keys, feature columns, and a metric column, and a second query input specifying a search vector of feature columns and metric, computes the similarity of each entity in the first input to the search vector based on their similarity. The score is computed as the cosine similarity of the feature column(s) for each entity with the feature column(s) for the search vector, which can optionally be TF/IDF weighted.
Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, with taking the maximum z value across all points in each bin as the output value for the bin. The aggregate performed to compute the value for each bin is specified by agg_type
, with allowed aggregate types of AVG
, COUNT
, SUM
, MIN
, and MAX
.
Similar to tf_geo_rasterize
, but also computes the slope and aspect per output bin. Aggregates point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type
) across all points in each bin as the output value for the bin.
Given a distance-weighted directed graph, consisting of a queryCURSOR
input consisting of the starting and ending node for each edge and a distance, and a specified origin and destination node, computes the shortest distance-weighted path through the graph between origin_node
and destination_node
.
Given a distance-weighted directed graph, consisting of a queryCURSOR
input consisting of the starting and ending node for each edge and a distance, and a specified origin node, computes the shortest distance-weighted path distance between the origin_node
and every other node in the graph.
Loads one or more las
or laz
point cloud/LiDAR files from a local file or directory source, optionally tranforming the output SRID to out_srs
. If not specified, output points are automatically transformed to EPSG:4326 lon/lat pairs).
Computes the Mandelbrot set over the complex domain [x_min
, x_max
), [y_min
, y_max
), discretizing the xy-space into an output of dimensions x_pixels
X y_pixels
.
Returns metadata for one or more las
or laz
point cloud/LiDAR files from a local file or directory source, optionally constraining the bounding box for metadata retrieved to the lon/lat bounding box specified by the x_min
, x_max
, y_min
, y_max
arguments.
Process a raster input to derive contour lines or regions and output as LINESTRING or POLYGON for rendering or further processing.
Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type
) across all points in each bin as the output value for the bin.
Used for generating top-k signals where 'k' represents the maximum number of antennas to consider at each geographic location. The full relevant parameter name is strongest_k_sources_per_terrain_bin.
Taking a set of point elevations and a set of signal source locations as input, tf_rf_prop_max_signal
executes line-of-sight 2.5D RF signal propagation from the provided sources over a binned 2.5D elevation grid derived from the provided point locations, calculating the max signal in dBm at each grid cell, using the formula for free-space power loss.
Function
Description
ST_Centroid
Computes the geometric center of a geometry as a POINT.
ST_GeomFromText(WKT)
Return a specified geometry value from Well-known Text representation.
ST_GeomFromText(WKT, SRID)
Return a specified geometry value from Well-known Text representation and an SRID.
ST_GeogFromText(WKT)
Return a specified geography value from Well-known Text representation.
ST_GeogFromText(WKT, SRID)
Return a specified geography value from Well-known Text representation and an SRID.
ST_Point(double lon, double lat)
Return a point constructed on the fly from the provided coordinate values. Constant coordinates result in construction of a POINT literal.
Example: ST_Contains(poly4326, ST_SetSRID(ST_Point(lon, lat), 4326))
ST_Buffer
Returns a geometry covering all points within a specified distance from the input geometry. Performed by the GEOS module. The output is currently limited to the MULTIPOLYGON type.
Calculations are in the units of the input geometry’s SRID. Buffer distance is expressed in the same units. Example:
SELECT ST_Buffer('LINESTRING(0 0, 10 0, 10 10)', 1.0);
Special processing is automatically applied to WGS84 input geometries (SRID=4326) to limit buffer distortion:
Implementation first determines the best planar SRID to which to project the 4326 input geometry.
Preferred SRIDs are UTM and Lambert (LAEA) North/South zones, with Mercator used as a fallback.
Buffer distance is interpreted as distance in meters (units of all planar SRIDs being considered).
The input geometry is transformed to the best planar SRID and handed to GEOS, along with buffer distance.
The buffer geometry built by GEOS is then transformed back to SRID=4326 and returned.
Example: Build 10-meter buffer geometries (SRID=4326) with limited distortion:
SELECT ST_Buffer(poly4326, 10.0) FROM tbl;
ST_Centroid
Computes the geometric center of a geometry as a POINT.
Function
Description
ST_TRANSFORM
Returns a geometry with its coordinates transformed to a different spatial reference. Currently, WGS84 to Web Mercator transform is supported. For example:ST_DISTANCE(
ST_TRANSFORM(ST_GeomFromText('POINT(-71.064544 42.28787)', 4326), 900913),
ST_GeomFromText('POINT(-13189665.9329505 3960189.38265416)', 900913)
)
ST_TRANSFORM
is not currently supported in projections. It can be used only to transform geo inputs to other functions, such as ST_DISTANCE.
ST_SETSRID
Set the SRID to a specific integer value. For example:
ST_TRANSFORM(
ST_SETSRID(ST_GeomFromText('POINT(-71.064544 42.28787)'), 4326), 900913 )
Function
Description
ST_X
Returns the X value from a POINT column.
ST_Y
Returns the Y value from a POINT column.
ST_XMIN
Returns X minima of a geometry.
ST_XMAX
Returns X maxima of a geometry.
ST_YMIN
Returns Y minima of a geometry.
ST_YMAX
Returns Y maxima of a geometry.
ST_STARTPOINT
Returns the first point of a LINESTRING as a POINT.
ST_ENDPOINT
Returns the last point of a LINESTRING as a POINT.
ST_POINTN
Return the Nth point of a LINESTRING as a POINT.
ST_NPOINTS
Returns the number of points in a geometry.
ST_NRINGS
Returns the number of rings in a POLYGON or a MULTIPOLYGON.
ST_SRID
Returns the spatial reference identifier for the underlying object.
ST_NUMGEOMETRIES
Returns the MULTI count of MULTIPOINT, MULTILINESTRING or MULTIPOLYGON. Returns 1 for non-MULTI geometry.
Function
Description
ST_INTERSECTION
Returns a geometry representing an intersection of two geometries; that is, the section that is shared between the two input geometries. Performed by the GEOS module.
The output is currently limited to MULTIPOLYGON type, because HEAVY.AI does not support mixed geometry types within a geometry column, and ST_INTERSECTION
can potentially return points, lines, and polygons from a single intersection operation.
Lower-dimension intersecting features such as points and line strings are returned as very small buffers around those features. If needed, true points can be recovered by applying the ST_CENTROID method to point intersection results. In addition, ST_PERIMETER/2 of resulting line intersection polygons can be used to approximate line length.
Empty/NULL geometry outputs are not currently supported.
Examples:
SELECT ST_Intersection('POLYGON((0 0,3 0,3 3,0 3))', 'POLYGON((1 1,4 1,4 4,1 4))');
SELECT ST_Area(ST_Intersection(poly, 'POLYGON((1 1,3 1,3 3,1 3,1 1))')) FROM tbl;
ST_DIFFERENCE
Returns a geometry representing the portion of the first input geometry that does not intersect with the second input geometry. Performed by the GEOS module. Input order is important; the return geometry is always a section of the first input geometry.
The output is currently limited to MULTIPOLYGON type, for the same reasons described in ST_INTERSECTION
. Similar post-processing methods can be applied if needed.
Empty/NULL geometry outputs are not currently supported.
Examples:
SELECT ST_Difference('POLYGON((0 0,3 0,3 3,0 3))', 'POLYGON((1 1,4 1,4 4,1 4))');
SELECT ST_Area(ST_Difference(poly, 'POLYGON((1 1,3 1,3 3,1 3,1 1))')) FROM tbl;
ST_UNION
Returns a geometry representing the union (or combination) of the two input geometries. Performed by the GEOS module.
The output is currently limited to MULTIPOLYGON type for the same reasons described in ST_INTERSECTION
. Similar post-processing methods can be applied if needed.
Empty/NULL geometry outputs are not currently supported.
Examples:
SELECT ST_UNION('POLYGON((0 0,3 0,3 3,0 3))', 'POLYGON((1 1,4 1,4 4,1 4))');
SELECT ST_AREA(ST_UNION(poly, 'POLYGON((1 1,3 1,3 3,1 3,1 1))')) FROM tbl;
Function
Description
ST_DISTANCE
Returns shortest planar distance between geometries. For example:
ST_DISTANCE(poly1, ST_GeomFromText('POINT(0 0)'))
Returns shortest geodesic distance between two points, in meters, if given two point geographies. Point geographies can be specified through casts from point geometries or as literals. For example:
ST_DISTANCE(
CastToGeography(p2),
ST_GeogFromText('POINT(2.5559 49.0083)', 4326)
)
SELECT a.name,
ST_DISTANCE(
CAST(a.pt AS GEOGRAPHY),
CAST(b.pt AS GEOGRAPHY)
) AS dist_meters
FROM starting_point a, destination_points b;
You can also calculate the distance between a POLYGON and a POINT. If both fields use SRID 4326, then the calculated distance is in 4326 units (degrees). If both fields use SRID 4326, and both are transformed into 900913, then the results are in 900913 units (meters).
The following SQL code returns the names of polygons where the distance between the point and polygon is less than 1,000 meters.
SELECT a.poly_name FROM poly a, point b WHERE ST_DISTANCE(
ST_TRANSFORM(b.location,900913),
ST_TRANSFORM(a.heavyai_geo,900913)
) < 1000;
ST_EQUALS
Returns TRUE if the first input geometry and the second input geometry are spatially equal; that is, they occupy the same space. Different orderings of points can be accepted as equal if they represent the same geometry structure.
POINTs comparison is performed natively. All other geometry comparisons are performed by GEOS.
If input geometries are both uncompressed or compressed, all comparisons to identify equality are precise. For mixed combinations, the comparisons are performed with a compression-specific tolerance that allows recognition of equality despite subtle precision losses that the compression may introduce. Note: Geo columns and literals with SRID=4326
are compressed by default.
Examples:
SELECT COUNT(*) FROM tbl WHERE ST_EQUALS('POINT(2 2)', pt);
SELECT ST_EQUALS('POLYGON ((0 0,1 0,0 1))', 'POLYGON ((0 0,0 0.5,0 1,1 0,0 0))');
ST_MAXDISTANCE
Returns longest planar distance between geometries. In effect, this is the diameter of a circle that encloses both geometries.For example:
Currently supported variants:
ST_CONTAINS
Returns true if the first geometry object contains the second object. For example:
You can also use ST_CONTAINS
to:
Return the count of polys that contain the point (here as WKT):
SELECT count(*) FROM geo1 WHERE ST_CONTAINS(poly1, 'POINT(0 0)');
Return names from a polys table that contain points in a points table:
SELECT a.name FROM polys a, points b WHERE ST_CONTAINS(a.heavyai_geo, b.location);
Return names from a polys table that contain points in a points table, using a single point in WKT instead of a field in another table:
SELECT name FROM poly WHERE ST_CONTAINS(
heavyai_geo, ST_GeomFromText('POINT(-98.4886935 29.4260508)', 4326)
);
ST_INTERSECTS
Returns true if two geometries intersect spatially, false if they do not share space. For example:
SELECT ST_INTERSECTS(
'POLYGON((0 0, 2 0, 2 2, 0 2, 0 0))',
'POINT(1 1)'
) FROM tbl;
ST_AREA
Returns the area of planar areas covered by POLYGON and MULTIPOLYGON geometries. For example:
SELECT ST_AREA(
'POLYGON((1 0, 0 1, -1 0, 0 -1, 1 0),(0.1 0, 0 0.1, -0.1 0, 0 -0.1, 0.1 0))'
) FROM tbl;
ST_AREA
does not support calculation of geographic areas, but rather uses planar coordinates. Geographies must first be projected in order to use ST_AREA
. You can do this ahead of time before import or at runtime, ideally using an equal area projection (for example, a national equal-area Lambert projection). The area is calculated in the projection's units. For example, you might use Web Mercator runtime projection to get the area of a polygon in square meters:
ST_AREA(
ST_TRANSFORM(
ST_GeomFromText(
'POLYGON((-76.6168198439371 39.9703199555959,
-80.5189990254673 40.6493554919257,
-82.5189990254673 42.6493554919257,
-76.6168198439371 39.9703199555959)
)', 4326
),
900913)
)
<code></code>
Web Mercator is not an equal area projection, however. Unless compensated by a scaling factor, Web Mercator areas can vary considerably by latitude.
ST_PERIMETER
Returns the cartesian perimeter of POLYGON and MULTIPOLYGON geometries. For example:
SELECT ST_PERIMETER('POLYGON(
(1 0, 0 1, -1 0, 0 -1, 1 0),
(0.1 0, 0 0.1, -0.1 0, 0 -0.1, 0.1 0)
)'
)
from tbl;
It will also return the geodesic perimeter of POLYGON and MULTIPOLYGON geometries. For example:
SELECT ST_PERIMETER(
ST_GeogFromText(
'POLYGON(
(-76.6168198439371 39.9703199555959,
-80.5189990254673 40.6493554919257,
-82.5189990254673 42.6493554919257,
-76.6168198439371 39.9703199555959)
)',
4326)
)
from tbl;
ST_LENGTH
Returns the cartesian length of LINESTRING geometries. For example:
SELECT ST_LENGTH('LINESTRING(1 0, 0 1, -1 0, 0 -1, 1 0)') FROM tbl;
It also returns the geodesic length of LINESTRING geographies. For example:
SELECT ST_LENGTH(
ST_GeogFromText('LINESTRING(
-76.6168198439371 39.9703199555959,
-80.5189990254673 40.6493554919257,
-82.5189990254673 42.6493554919257)',
4326)
) FROM tbl;
ST_WITHIN
Returns true if geometry A is completely within geometry B. For example the following SELECT
statement returns true:
SELECT ST_WITHIN(
'POLYGON ((1 1, 1 2, 2 2, 2 1))',
'POLYGON ((0 0, 0 3, 3 3, 3 0))'
) FROM tbl;
ST_DWITHIN
Returns true if the geometries are within the specified distance of each one another. Distance is specified in units defined by the spatial reference system of the geometries. For example:
SELECT ST_DWITHIN(
'POINT(1 1)',
'LINESTRING (1 2,10 10,3 3)', 2.0
) FROM tbl;
ST_DWITHIN
supports geodesic distances between geographies, currently limited to geographic points. For example, you can check whether Los Angeles and Paris, specified as WGS84 geographic point literals, are within 10,000km of one another.
SELECT ST_DWITHIN(
ST_GeogFromText(
'POINT(-118.4079 33.9434)', 4326),
ST_GeogFromText('POINT(2.5559 49.0083)',
4326 ),
10000000.0) FROM tbl;
ST_DFULLYWITHIN
Returns true if the geometries are fully within the specified distance of one another. Distance is specified in units defined by the spatial reference system of the geometries. For example:
SELECT ST_DFULLYWITHIN(
'POINT(1 1)',
'LINESTRING (1 2,10 10,3 3)',
10.0) FROM tbl;
This function supports:
ST_DFULLYWITHIN(POINT, LINESTRING, distance)
ST_DFULLYWITHIN(LINESTRING, POINT, distance)
ST_DISJOINT
Returns true if the geometries are spatially disjoint (that is, the geometries do not overlap or touch. For example:
SELECT ST_DISJOINT(
'POINT(1 1)',
'LINESTRING (0 0,3 3)'
) FROM tbl;
<num_strings>
The number of strings to randomly generate.
BIGINT
<string_length>
Length of the generated strings.
BIGINT
id
Integer id of output, starting at 0 and increasing monotonically
Column<BIGINT>
rand_str
Random String
Column<TEXT ENCODING DICT>
Type
Size
Example
LINESTRING
Variable
A sequence of 2 or more points and the lines that connect them. For example: LINESTRING(0 0,1 1,1 2)
MULTIPOLYGON
Variable
A set of one or more polygons. For example:MULTIPOLYGON(((0 0,4 0,4 4,0 4,0 0),(1 1,2 1,2 2,1 2,1 1)), ((-1 -1,-1 -2,-2 -2,-2 -1,-1 -1)))
POINT
Variable
A point described by two coordinates. When the coordinates are longitude and latitude, HEAVY.AI stores longitude first, and then latitude. For example: POINT(0 0)
POLYGON
Variable
A set of one or more rings (closed line strings), with the first representing the shape (external ring) and the rest representing holes in that shape (internal rings). For example: POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))
MULTIPOINT
Variable
A set of one or more points. For example: MULTIPOINT((0 0), (1 1), (2 2))
MULTILINESTRING
Variable
A set of one or more associated lines, each of two or more points. For example: MULTILINESTRING((0 0, 1 0, 2 0), (0 1, 1 1, 2 1))
Release notes for currently supported releases
Use of HEAVY.AI is subject to the terms of the HEAVY.AI End User License Agreement (EULA).
The latest release of HEAVY.AI is 6.4.3.
6.4.3 | 6.4.2 | 6.4.1 | 6.4.0 | 6.2.7 | 6.2.5 | 6.2.4 | 6.2.1 | 6.2.0 | 6.1.1 | 6.1.0 | 6.0.0
For release notes for releases that are no longer supported, as well as links to documentation for those releases, see Archived Release Notes.
As with any software upgrade, it is important that you back up your data before you upgrade HEAVY.AI. Each release introduces efficiencies that are not necessarily compatible with earlier releases of the platform. HEAVY.AI is never expected to be backward compatible.
For assistance during the upgrade process, contact HEAVY.AI support at support@heavy.ai before you upgrade your system.
Added feature flag ui/session_create_timeout
with a default value of 10000 (10 seconds) for modifying login request timeout.
Adds the HeavyDB server configuration parameter enable-foreign-table-scheduled-refresh
for enabling or disabling automated foreign table scheduled refreshes..
Fixes a crash that could occur when S3 CSV-backed foreign tables with append refreshes are refreshed multiple times.
Fixes a crash that could occur when foreign tables with geospatial columns are refreshed after cache evictions.
Fixes a crash that could occur when querying foreign tables backed by Parquet files with empty row groups.
Fixes an error that could occur when select queries used in ODBC foreign tables reference case sensitive column names.
Fixes a crash that could occur when CSV backed foreign tables with geospatial columns are refreshed without updates to the underlying CSV files.
Fixes a crash that could occur in heavysql when executing the \detect command with geospatial files.
Fixes a casting error that could occur when executing left join queries.
Fixes a crash that could occur when accessing the disk cache on HeavyDB servers with the read-only configuration parameter enabled.
Fixes an error that could occur when executing queries that project geospatial columns.
Fixes a crash that could occur when executing the EXTRACT function with the ISODOW date_part
parameter on GPUs.
Fixes an error that could occur when importing CSV or Parquet files with text columns containing more than 32,767 characters into HeavyDB NONE ENCODED text columns.
Fixes a Vulkan Device Lost error that could occur when rendering complex polygon data with thousands of polygons in a single pixel.
Optimizes result set buffer allocations for CPU group by queries.
Enables trimming of white spaces in quoted fields during CSV file imports, when both the trim_spaces
and quoted
options are set.
Fixes an error that could occur when importing CSV files with quoted fields that are surrounded by white spaces.
Fixes a crash that could occur when tables are reordered for range join queries.
Fixes a crash that could occur for join queries with intermediate projections.
Fixes a crash that could occur for queries with geospatial join predicate functions that use literal parameters.
Fixes an issue where queries could intermittently and incorrectly return error responses.
Fixes an issue where queries could return incorrect results when filter push-down through joins is enabled.
Fixes a crash that could occur for queries with join predicates that compare string dictionary encoded and nonencoded text columns.
Fixes an issue where hash table optimizations could ignore the max-cacheable-hashtable-size-bytes
and hashtable-cache-total-bytes
server configuration parameters.
Fixes an issue where sharded table join queries that are executed on multiple GPUs could return incorrect results.
Fixes a crash that could occur when sharded table join queries are executed on multiple GPUs with the from-table-reordering
server configuration parameter enabled.
Multilayer support for Contour and Windbarb charts.
Enable Contour charts by default (feature flag: ui/enable_contour_chart
).
Support custom SQL measures in Contour charts.
Restrict export from Heavy Immerse by enabling trial mode (feature flag: ui/enable_trial_mode
). Trial mode enables a super user to restrict export capabilities for users who have the immerse_trial_mode
role.
Allow MULTILINESTRING to be used in selectors for Linemap charts.
Allow MULTILINESTRING to be used in Immerse SQL Editor.
This release features general availability of data connectors for PostGreSQL, beta Immerse connectors for Snowflake and Redshift, and SQL support for Google BigQuery and Hive (beta). These managed data connections let you use HEAVY.AI as an acceleration platform wherever your source data may live. Scheduling and automated caching ensure that fast analytics are always running on the latest available data.
Immerse features four new chart types: Contour, Cross-section, Wind barb, and Skew-t. While especially useful for atmospheric and geotechnical data visualization, Contour and Cross-section also have more general application.
Major improvements for time series analysis have been added. This includes an Immerse user interface for time series, and a large number of SQL window function additions and performance enhancements.
The release also includes two major architectural improvements:
The ability to perform cross-database queries, both in SQL and in Immerse, increasing flexibility across the board. For example, you can now easily build an Immerse dashboard showing system usage combined with business data. You might also make a read-only database of data shared across a set of users.
Render queries no longer block other GPU queries. In many use cases, renders can be significantly slower than other common queries. This should result in significant performance gains, particularly in map-heavy dashboards.
Adds support for cross database SELECT, UPDATE, and DELETE queries.
Support for MODE SQL aggregate.
Add support for strtok_to_array.
Support for ST_NumGeometries().
Support ST_TRANSFORM applied to literal geo types.
Enhanced query tracing ensures all child operations for a query_id are properly logged with that ID.
Adds support for BigQuery and Hive HeavyConnect and import.
Adds support for table restore from S3 archive files.
Improves integer column type detection in Snowflake import/HeavyConnect data preview.
Adds HeavyConnect and import support for Parquet required scalar fields.
Improves import status error message when an invalid request is made.
Support POINT, LINESTRING, and POLYGON input and output types in table functions.
Support default values for scalar table function arguments.
Add tf_raster_contour table function to generate contours given x, y, and z arguments. This function is exposed in Immerse, but has additional capabilities available in SQL, such as supporting floating point contour intervals.
Return file path and file name from tf_point_cloud_metadata table function.
Previous length limit of 32K characters per values for none-encoded text columns has been lifted, and now none-encoded text values can be up to 2^31 - 1 characters (approximately 2.1billion characters).
Support array column outputs from table functions.
Add TEXT ENCODING DICT and Array<TEXT ENCODING DICT> type support for runtime functions/UDFs.
Allow transient TEXT ENCODING DICT column inputs into table functions.
Support COUNT_IF function.
Support SUM_IF function.
Support NTH_VALUE window function.
Support NTH_VALUE_IN_FRAME window function.
Support FIRST_VALUE_IN_FRAME and LAST_VALUE_IN_FRAME window functions.
Support CONDITIONAL_TRUE_EVENT.
Support ForwardFill and BackwardFill window functions to fill in missing (null) values based on previous non-null values in window.
Fixes an issue where databases with duplicate names but different capitalization could be created.
Fixes an issue where raster imports could fail due to inconsistent band names.
Fixes an issue that could occur when DUMP/RESTORE commands were executed concurrently.
Fixes an issue where certain session updates do not occur when licenses are updated.
Fixes an issue where import/HeavyConnect data preview could return unsupported decimal types.
Fixes an issue where import/HeavyConnect data preview for PostgreSQL queries involving variable length columns could result in an error.
Fixes an issue where NULL elements in array columns with the NOT NULL constraint were not projected correctly.
Fixes a crash that could occur in certain scenarios where UPDATE and DELETE queries contain subqueries.
Fixes an issue where ingesting ODBC unsigned SQL_BIGINT into HeavyDB BIGINT columns using HeavyConnect or import could result in storage of incorrect data.
Fixes a crash that could occur in distributed configurations, when switching databases and accessing log based system tables with rolled off logs.
Fixes an error that occurred when importing Parquet files that did not contain statistics metadata.
Ensure query hint is propagated to subqueries.
Fix crash that could occur when LAG_IN_FRAME or LEAD_IN_FRAME were missing order-by or frame clause.
Fix bug where LAST_VALUE window function could return wrong results.
Fix issue where “Cannot use fast path for COUNT DISTINCT” could be reported from a count distinct operation.
Various bug fixes for support of VALUES() clause.
Improve handling of generic input expressions for window aggregate functions.
Fix bug where COUNT(*) and COUNT(1) over window frame could cause crash.
Fix wrong coordinate used for origin_y_bin in tf_raster_graph_shortest_slope_weighted_path.
Speed up table function binding in cases with no ColumnList arguments.
Support arrays of transient encoded strings into table functions.
Render queries no longer block parallel execution queue for other queries.
The Immerse PostgreSQL connector is now generally available, and is joined by public betas of Redshift and Snowflake.
New chart types:
Contour chart. Contours can be applied to any geo point data, but are especially useful when applied to smoothly-varying pressure and elevation data. They can help reveal general patterns even in noisy primary data. Contours can be based on any point data, including that from regular raster grids like a temperature surface, or from sparse points like LiDAR data.
Cross-section chart. As the name suggests, this allows a new view on 2.5D or 3D datasets, where a selected data dimension is plotted on the vertical axis for a slice of geographic data. In addition to looking in profile at parts of the atmosphere in weather modeling, this can also be used to look at geological sections below terrain.
Representing vector force fields takes a step forward with the Wind barb plot. Wind barbs are multidimensional symbols which convey at a glance both strength and direction.
Skew-T is a highly specialized multidimensional chart used primarily by meteorologists. Skew-Ts are heavily used in weather modeling and can help predict, for example, where thunderstorms or dry lightning are likely to occur.
Initial support for window functions in Immerse, enabling time lag analysis in charts. For example, you can now plot month-over-month or quarter-over-quarter sales or web traffic volume.
For categorical data, in addition to supporting aggregations based on the number of unique values, MODE is now supported. This supports the creation of groups based on the most-common value.
Fixed an issue where a restarted server can potentially deadlock if the first two queries are executed at the same time and use different executors.
Fixed an issue where COUNT DISTINCT or APPROX_COUNT_DISTINCT, when run on a CASE statement that outputs literal strings, could cause a crash.
Fixes a crash when using COUNT() or COUNT(1) with the window function, i.e., COUNT(*) OVER (PARTITION BY x).
Fixes an incorrect result when using a date column as a partition key, like SUM(x) OVER (PARTITION BY DATE_COL).
Improves the performance of window functions when a literal expression is used as one of the input expressions of window functions like LAG(x, 1).
Improves query execution preparation phase by preventing redundant processing of the same nodes, especially when a complex input query is evaluated.
Fixes geometry type checking for range join operator that could cause a crash in some cases.
Resolves a query that may return an incorrect result when it has many projection expressions (for example, more than 50 8-byte output expressions) when using a window function expression.
Fixes an issue where the Resultset recycler ignores the server configuration size metrics.
Fixes a race condition where multiple catalogs could be created on initialization, resulting in possible deadlocks, server hangs, increased memory pressure, and slow performance.
Fixes a crash encountered during some SQL queries when the read-only setting was enabled.
Fixes an issue in tf_raster_graph_shortest_slope_weighted_path
table function that would lead some inputs to be incorrectly rejected.
In Release 6.2.0, Heavy Immerse adds animation and a control panel system. HeavyConnect now includes connectors for Redshift, Snowflake, and PostGIS. The SQL system is extended with support for casting and time-based window functions. GeoSQL gets direct LiDAR import, multipoints, and multilinestrings, as well as graph network algorithms. Other enhancements include performance improvements and reduced memory requirements across the product.
TRY_CAST support for string to numeric, timestamp, date, and time casts.
Implicit and explicit CAST support for numeric, timestamp, date, and time to TEXT type.
CAST support from Timestamp(0|3|6|9) types to Time(0) type.
Concat (||) operator now supports multiple nonliteral inputs.
JSON_VALUE operator to extract fields from JSON string columns.
BASE64_ENCODE and BASE64_DECODE operators for BASE64 encoding/decoding of string columns.
POSITION operator to extract index of search string from strings.
Add hash-based count distinct operator to better handle case of sparse columns.
Support MULTILINESTRING OGC geospatial type.
Support MULTIPOINT OGC geospatial type.
Support ST_NumGeometries.
Support ST_ConvexHull and ST_ConcaveHull.
Improved table reordering to maximize invocation of accelerated geo joins.
Support ST_POINT, ST_TRANSFORM and ST_SETSRID as expressions for probing columns in point-to-point distance joins.
Support accelerated overlaps hash join for ST_DWITHIN clause comparing two POINT columns.
Support for POLYGON to MULTIPOLYGON promotion in SQLImporter.
RANGE window function FRAME support for Time, Date, and Timestamp types.
Support LEAD_IN_FRAME / LAG_IN_FRAME window functions that compute LEAD / LAG in reference to a window frame.
Add TextEncodingNone support for scalar UDF and extension functions.
Support array inputs and outputs to table functions.
Support literal interval types for UDTFs.
Add support for table functions range annotations for literal inputs
Make max CPU threads configurable via a startup flag.
Support array types for Arrow/select_ipc endpoints.
Add support for query hint to control dynamic watchdog.
Add query hint to control Cuda block and grid size for query.
Adds an echo all
option to heavysql that prints all executed commands and queries.
Improved decimal precision error messages during table creation.
Add support for file roll offs to HeavyConnect local and S3 file use cases.
Add HeavyConnect support for non-AWS S3-compatible endpoints.
LiDAR
Add tf_point_cloud_metadata
table function to read metadata from one or more LiDAR/point cloud files, optionally filtered by a bounding box.
Add tf_load_point_cloud
table function to load data from one or more LiDAR/point cloud files, optionally filtered by bounding box and optionally cached in memory for subsequent queries.
Graph and Path Functions
Add tf_graph_shortest_path
table function to compute shortest edge-weighted path between two points in a graph constructed from an input edge list
Add tf_graph_shortest_paths_distances
table function to compute the shortest edge-weighted distances between a starting point and all other points in a graph constructed from an input edge list.
Add tf_grid_graph_shortest_slope_weighted_path
table function to compute the shortest slope-weighted path between two points along rasterized data.
Enhanced Spatial Aggregations
Support configurable aggregation types for tf_geo_rasterize
and tf_geo_rasterize_slope
table functions, allowing for AVG, MIN, MAX, SUM, and COUNT aggregations.
Support two-pass gaussian blur aggregation post-processing for tf_geo_rasterize
and tf_geo_rasterize_slope
table functions.
RF Propagation Extension Improvements
Add dynamic ray splitting to tf_rf_prop_max_signal
table function for improved performance and terrain coverage.
Add variant of tf_rf_prop_max_signal
table function that takes per-RF source/tower transmission power (watts) and frequency (MHz).
Add variant of generate_series
table function that generates series of timestamps between a start and end timestamp at specified time intervals.
ST_Centroid now automatically picks up SRID of underlying geometry.
Fixed a crash that occurred when ST_DISTANCE had an ST_POINT input for its hash table probe column.
Fixed an issue where a query hint would not propagate to a subquery.
Improved overloaded table function type deduction eliminates type mismatches when table function outputs are used downstream.
Properly handle cases of RF sources outside of terrain bounding box for tf_rf_prop_max_signal
.
Fixed an issue where specification of unsupported GEOMETRY column type during table creation could lead to a crash.
Fixed a crash that could occur due to execution of concurrent create and drop table commands.
Fixed a crash that could occur when accessing the Dashboards system table.
Fixed a crash that could occur as a result of type mismatches in ITAS queries.
Fixed an issue that could occur due to band name sanitization during raster imports.
Fixed a memory leak that could occur when dropping temporary tables.
Fixed a crash that could occur due to concurrent execution of a select query and long-running write query on the same table.
Disables render group assignment by default.
Supports rendering of MULTILINESTRING geometries.
Memory footprint required for compositing renders on multi-GPU systems is significantly reduced. Any multi-GPU system will see improvements, but is most noticeable on systems with 4 or more GPUs. For example, rendering a 1400 x 1400 image results in ~450mb of memory saved when using 8 GPUs for a query. Multi-gpu system configurations should be able to set the res-gpu-mem
configuration flag value lower as a result, freeing memory for other subsystems.
Adds INFO logging of peak render memory usage for the lifetime of the server process. The render memory logged is peak render query output buffer size (controlled with the render-mem-bytes
configuration flag) and peak render buffer usage (controlled with the res-gpu-mem
configuration flag). These peaks are logged in the INFO log on server shutdown, when GPU memory is cleared via clear_gpu_memory
endpoint, or when a new peak is reached. These logged peaks can be useful to adjust the render-mem-bytes
and res-gpu-mem
configuration flags to improve memory utilization by avoiding reserving memory that might go unused. Examples of the log messages:
When a new peak render-mem-bytes
is reached: New peak render buffer usage (render-mem-bytes):37206200 of 1000000000
When a new peak res-gpu-mem
is reached: New peak render memory usage (res-gpu-mem): 166033024
Peaks logged on server shutdown or on clear_gpu_memory
:
Render memory peak utilization:
Query result buffer (render-mem-bytes): 37206200 of 1000000000
Images and buffers (res-gpu_mem): 660330240
Total allocated: 1660330240
Fixed an issue the occurred when trying to hit-test a multiline SQL expression.
Dashboard and chart image export
Crossfilter replay
Improved popup support in the base 3D chart
New Multilayer CPU rendered Geo charts: Pointmap, Linemap, and Choropleth (Beta)
Control Panel (Beta)
Redshift, Snowflake, and PostGIS HeavyConnect support (Beta)
Skew-T chart (Beta)
Support for limiting the number of charts in a dashboard through the ui/limit_charts_per_dashboard
feature flag. The default value is 0 (no limit).
Fixed duplicate column names importer error.
Various bug fixes and user-interface improvements.
Adds support for POLYGON to MULTIPOLYGON promotion in the load table Thrift APIs and SQLImporter.
Fixes an issue that caused an intermittent KafkaImporter crash on CentOS 7.9.
Fixes an issue that cause incorrect results in multiple aggregation of date columns that include COUNT DISTINCT.
Adds support for limiting the number of charts in a dashboard through the ui/limit_charts_per_dashboard
feature flag. The default value is 0 (no limit).
Adds a new set of log-based (request_logs, server_logs, web_server_logs, and web_server_access_logs) system tables.
Adds a new Request Logs and Monitoring dashboard.
Adds a new SHOW CREATE SERVER command, which displays the create server DDL for a specified foreign server.
Adds support for non-super-user execution of SHOW CREATE TABLE on views.
Adds a new ALTER SESSION SET EXECUTOR_DEVICE command, which updates the type of executor device (CPU or GPU) for the current session.
Adds a new ALTER SESSION SET CURRENT_DATABASE command, which updates the connected database for the current session.
Adds a new ALTER DATABASE OWNER TO command, which allows super users to change the owner of a database.
Extends the INSERT command to support inserting multiple rows at once/batch insert.
Add support for default values on shard key columns.
Add initial support for Window function framing, including support for BETWEEN ROWS clause for all numeric and date/time types, and BETWEEN RANGE clause for numeric types.
Enable group-by push down for UNION ALL such that group-by and aggregate operations applied to the output of UNION ALL are evaluated on the UNION ALL inputs, improving performance.
Add support for LCASE (alias for LOWER), UCASE (alias for UPPER), LEFT, and RIGHT string functions
Adds a new trim_spaces option for delimited file import.
(BETA) Adds data import/COPY FROM support from Relational Database Management Systems and Data Warehouses using the Open Database Connectivity (ODBC) interface.
Initial support for CUDA streams to parallelize GPU computation and memory transfers.
Increase per-GPU projection limit with watchdog enabled from 32M to 128M rows to take advantage of improvements in large projection support in recent releases.
Add new SHOW FUNCTIONS and SHOW FUNCTIONS DETAILS commands to show registered compile-time UDFs and extension functions in the system and their arguments, and SHOW RUNTIME FUNCTIONS [DETAILS] and SHOW RUNTIME TABLE FUNCTIONS [DETAILS] to show user-defined runtime functions/table functions.
Support timestamp inputs and outputs for table functions.
Advanced Analytics
Add tf_compute_dwell_times table function, that given a query input input with entity keys and timestamps, and parameters specifying the minimum session time, minimum number of session records, and max inactive seconds, outputs all unique sessions found in the data with the duration of the session (dwell time).
Add tf_feature_self_similarity table function, that given a query input of entity keys/IDs, a set of feature columns, and a metric column, scores each pair of entities based on their similarity, computed as the cosine similarity of the feature column(s) between each entity pair, which can optionally TF/IDF weighted.
Add tf_feature_similarity table function, that given a query input of entity keys, feature columns, and a metric column, and a second query input specifying a search vector of feature columns and metric, computes the similarity of each entity in the first input to the search vector based on their similarity, computed as the cosine similarity of the feature column(s) for each entity with the feature column(s) for the search vector, which can optionally be TF/IDF weighted.
Fixed an issue where some join queries on ODBC-backed foreign tables can return empty result sets for the first query.
Fixed an issue where append refreshes on foreign tables backed by delimited or regex-parsed files ignore file-path filter and sort options.
Fixed a crash that can occur when very large dates are specified for the refresh_start_date_time foreign table option.
Fixed a crash that can occur when a foreign table’s data source is updated within a refresh window.
Fixed an issue where databases owned by deleted user accounts are not visible, and adds a restriction that prevents dropping users who own databases.
Fixed an issue where joins on string dictionary-encoded columns would hit spurious none-encoded string translation
Fixed issue with certain UNION ALL query patterns, such as UNION ALL containing logical values.
Disabled KEY_FOR_STRING for UNNEST operations on string dictionary-encoded columns, to prevent a crash.
Fixed an issue where logged stats for raster imports could overflow.
Fixed an issue where joins on synthetic tables (for example, created with a VALUES statement or table function without an underlying table) could crash.
Fixed an issue where require checks used on string dictionary inputs to a table function could crash.
Fixed a crash and/or wrong query results that can occur when a decimal literal is used in a nested query.
Fixed a potential crash when attempting to auto-retry a render immediately after an OutOfGpuMemory exception is thrown. This crash can occur only if the render-oom-retry-threshold
configuration option is set.
Fixed a regression where polygons with transparent colors are rendered opaque.
Corrects an issue with point/symbol rendering with explicitly Vega projections where the projection was not being updated when panned/zoomed if the query did not change.
Significant improvements in hit-testing consistency and stability when rendering queries with subqueries, window functions, or table functions.
Font size controls.
Borders and Zebra Striping in Table charts.
Justify content in Table charts.
Customization polygon border control.
Allow measure date formatting for table charts.
Extend y-axis on Vega combo charts to end at the next whole value past the highest data point.
Add layer visibility toggle to kebab dropdown on multi-layer raster charts.
Made unsaved changes modal less aggressive.
Custom Source Table Functions Browser
Don’t show unsaved warning modal after adding default filter set.
(BETA) PostgreSQL connector.
Allows maxBounds to be set in servers.json.
Toggle dashboard unsaved when updating annotations.
Dashboard save state behavior fixes.
Table Chart order by group keys when present.
Use key_for_string when ordering by known dictionary measures/dimensions.
Add default formatting for date/time on table chart.
Add admin feature flag to hide key manager.
Customizable polygon border color and existing border bug fixes.
Cannot append to table using PostgreSQL connector.
Building a raster chart with the layer visibility toggle feature flag enabled causes a crash.
Support for fast string functions on dictionary-encoded text columns (the default), including LOWER, UPPER, INITCAP, TRIM/LTRIM/RTRIM, LPAD/RPAD, REVERSE, REPEAT, SUBSTRING/SUBSTR, REPLACE, OVERLAY, SPLIT_PART, REGEXP_REPLACE, REGEXP_SUBSTR, AND CONCAT (||). The output of these expressions can be chained, grouped-by, and used in both the left and right side of join predicates.
Support for fast string equality/inequality operations without the previous requirement of watchdog disablement when the two columns do not share dictionaries.
Support for fast case statements with multiple text column inputs that do not share dictionary-encoded strings.
Support for ENCODE_TEXT to encode none-encoded strings, which can then be grouped on and manipulated like dictionary-encoded strings. This operator is not intended for interactive use at scale but instead for ELT-like scenarios. Use the new server flag watchdog-none-encoded-string-translation-limit
to set the upper cardinality allowed for such operations (1,000,000 by default).
Support for UNION ALL is enabled by default, and now works across string columns that do not share dictionaries with significantly better performance than in the previous release.
Window functions now support expressions in the PARTITION BY and ORDER clauses.
Support for subqueries in CASE statement clauses.
SHOW USER DETAILS is changed to only list those users with access to the currently-selected database. Previously, all users on the HeavyDB instance would be listed; this is still available to superusers with SHOW ALL USER DETAILS.
10X improvements in initial join performance (including geo joins) through faster, parallelized hash table construction, removing redundant inter-thread hash table computation.
Improved join ordering to avoid loop joins in certain scenarios.
Parallel compilation of queries as well as inter-executor generated code increases concurrency and throughput in common, Immerse-driven scenarios by up to 20%. Also decreases latency for a single user interacting with dashboards or issuing SQL queries in a way that required new plans to be code-generated.
New result set recycler allows query substeps (expensive in subqueries) can be cached using SQL hints ( /*+ keep_result */), dramatically improving performance where the subquery is reused across multiple queries (for example, in Immerse) and only outer steps of the query vary.
The default for the header option of COPY TO
to a CSV/TSV file has been changed from 'false'
to 'true'
.
Faster dictionary map in StringDictionaryProxy, accelerating various string operations involving transient entries.
Arrow execution endpoints now use multiple executors and can run concurrently like queries issued to the Thrift endpoints.
Addition of sparse dictionary output capability for Arrow queries, which automatically creates a subset of a string dictionary to send via Arrow when it detects that it is faster than sending the full, unfiltered dictionary. This provides orders-of-magnitude better server- and client-side performance and scalability for common cases where large dictionary-encoded text columns are filtered or top-k sorted such that only a small subset of dictionary entries are needed in the result set.
ST_INTERSECTS now can operate directly on top of compressed (the default) coordinates, leading to 2-3X increase in speed.
New table function framework allows for both system and user-defined table functions. Table functions can run on both CPU and GPU and are designed for efficient, scalable execution of custom algorithms in-situ on data that might be hard or impossible to implement in SQL.
Support for generate_series table function (similar to Postgres) for easy and fast integer series generation, particularly useful for left joins against binned tables to fill in gaps, whether for visualization or downstream operations like window functions, and generate_random_strings for generation of string columns of a user-defined size and cardinality.
Support for geo_rasterize and geo_rasterize_slope table functions to efficiently bin vector data into gap-free bins, with the optional ability to fill in null values, apply box blur, and compute slope and aspect ratios
Initial support for HeavyRF, a module that allows for real-time, ray-based computation of signal propagation, taking inputs of both terrain data and real or hypothetical signal sources.
Beta support for Python-defined scalar (row-level) and tabular User Defined Functions (UDFs and UDTFs), using the RBC library to translate Numba python code into LLVM IR that is JITed into query execution code for fast, scalable, custom user-defined capabilities.
Complete redesign and rewrite of Parquet import to one that is more robust, efficient, and performant.
Adds support for import from regex parsed files on either the server file system or S3 using the COPY FROM command.
The geo
and parquet
WITH
options of COPY FROM
have been deprecated and replaced by source_type
. Using the deprecated syntax generates the following:
Deprecation Warning: COPY FROM WITH (geo='true') is deprecated. Use WITH (source_type='geo_file') instead.
Update any scripts you have to replace the deprecated syntax with the new syntax. For more information, see CSV/TSV Import.
(BETA) Adds support for import from RDMS/data warehouses using the COPY FROM command.
Adds system table support.
A new default information_schema database contains 10 new system tables that provide information regarding CPU/GPU memory utilization, storage space utilization, database objects, and database object permissions.
New system dashboards that enable intuitive visualization of system resource utilization and user roles and permissions.
Support for Zarr and NetCDF raster file import.
You can now import raster files with ground control points geospatial references.
Support for file path filtering, globbing, and sorting when importing geo and raster files.
Improved error messaging when attempting to save a dashboard that uses a duplicate dashboard name.
Support for connections to delimited files on either the server file system or S3. S3 support includes an option to use the S3 Select API, which provides better performance but with limitations on supported column types.
Support for connections to Parquet files on either the server file system or S3. HeavyConnect leverages Parquet metadata to provide efficient data access and row group-level filter push down.
Parquet column type coercion. Convert Parquet column types to more memory-efficient HeavyDB column types for use cases that guarantee no loss of information.
Connections to regex parsed files on either the server file system or S3. This enables you to query unstructured text files, such as logs, by specifying regular expression patterns that extract components of the text files into table columns.
(BETA) Support for connections to Relational Database Management Systems and Data Warehouses, leveraging the Open Database Connectivity (ODBC) interface to provide seamless access to data.
(BETA) ODBC column type coercion. Use HeavyConnect to convert ODBC column types to more memory-efficient HeavyDB column types for use cases that guarantee no loss of information.
Support for scheduled data refreshes. Specify a start date time and interval at which connected data gets refreshed.
Adds support for disk level caching. By default, data fetched by HeavyConnect are cached at the disk level in addition to normal CPU/GPU level caching. This provides better overall query performance for network based connections, such as S3, and systems with limited CPU/GPU memory capacity. Disk cache size and level can be set through HeavyDB server configuration.
Adds support for file path filtering, globbing, and sorting for Parquet, delimited, and regex parsed file use cases.
Complete redesign and rewrite of the Parquet detect_column_types Thrift API. The Parquet detect/data preview feature is now more robust, efficient, and performant.
Change to query interrupt mechanism allowing certain classes of queries such as loop joins to be easily and quickly interrupted.
Fixed crash that could occur with joins on predicates that had functions on the left hand side expression, i.e. geoToH3.
Fix crash that could occur with Arrow queries that did not return results.
Avoid building metadata for empty result sets.
Fixes a crash that can occur when executing queries on GPU that involve a baseline group by and variable length column projections.
Fixes some table query concurrency bottlenecks. Previously, queries such as INSERT, TRUNCATE, and DROP TABLE required system wide locks to execute and would therefore block execution of other unrelated queries. These kinds of queries can now be executed concurrently.
Fixes a crash that can occur on server restart when the disk cache is enabled and tables with cached data are deleted.
Fixes a crash that can occur when the max_rows table option is altered for an empty table.
Fixes an issue in the JDBC driver where tables from multiple databases are listed even when a single database is specified.
Fixes an issue where raster POINT column type import would incorrectly throw an exception.
Fixes a crash that can occur when restoring a dump for a table with previously deleted columns.
Updates the export COPY TO command to include headers by default.
Removes the file_type parameter from the create_table Thrift API. This parameter was not used.
Fixes a crash that can occur when executing SQL commands containing comments.
Fixed the setting for default database (DEFAULT_DB) being ignored in a SAML login for a user who already exists.
The OpenGL renderer driver has been fully removed as of this release. Vulkan is the only available driver and enables a more modern, flexible API. As a result, the renderer-use-vulkan-driver
program option has been removed. Remove any references to that program option from your configuration files. For more on the move to the Vulkan driver, see Vulkan Renderer.
A novel polygon rendering algorithm is now used as the default when rendering polygons. This algorithm does no triangulation nor does it require “render groups” (a hidden column to assist the old polygon rendering algorithm). However, the render groups column is still added on import as a fallback. See Importing Geospatial Data for more on render group deprecation.
You can now hit-test certain render queries with subqueries more effectively. For example, if the subquery is only used for filter predicates, renders should now be sped up and hit-testing more flexible.
Render times are now being logged correctly (“render_vega-COMPLETED nonce:2 Total Execution: (ms), Total Render: (ms)”). The execution time and render time were incorrectly logged as 0 in Releases 5.9 and 5.10
Fixes a regression introduced in Release 5.10.0 when hit-testing an Immerse cohort-generated query. The hit-test would result in an error such as the following: “Cannot find column in hit-test cache for query …”
Resolves a crash when trying to hit-test render queries with window functions or cursorless table functions.
Fixes an issue where a multi-layer, multi-GPU render with a poly or line mark as the first layer can result in ghosting artifacts if the query associated with that layer resulted in 0 rows.
Fixes an issue when switching between a density accumulation scale with an auto-computed range (via min/max/+-1stStdDev/+-2ndStdDev) to a scale with an explicitly defined range. In this case, the explicit case was not reflected.
Removes a legacy constraint that prevented you from rendering a query that referenced one or more tables with more than one polygon/multipolygon column.
Improved speed of server interface using the Thrift binary protocol.
Data Manager has been redesigned to support HeavyConnect via S3, server file uploads, and expanded raster file support.
Introduced the new Gauge chart type.
Introduced a Welcome Panel and Help Center menu.
Rebranded interface for HEAVY.AI. Updated styles for the default dark and light themes.
Added option to toggle the legend on the New Combo chart.
Added configuration option for setting the default chart type.
Added configuration option for hiding specified chart types.
Added auto-selection of geo columns and measures on geo chart types.
Adjusted maximum bins for larger Top-N groups.
Added support for cross-domain configuration without SSL.
BETA: Added filter support for global custom expressions.
BETA: Introduced the new iframe chart type.
BETA: Introduced Arrow transport protocol for a limited number of chart types.
Fixed various minor UI and performance issues.
Fixed parameter creation from dashboard title in Safari browser.
Fixed displaying of the Jupyter logo when integration is unavailable.
Given a query input with entity keys (for example, user IP addresses) and timestamps (for example, page visit timestamps), and parameters specifying the minimum session time, the minimum number of session records, and the max inactive seconds, outputs all unique sessions found in the data with the duration of the session (dwell time).
entity_id
Column containing keys/IDs used to identify the entities for which dwell/session times are to be computed. Examples include IP addresses of clients visiting a website, login IDs of database users, MMSIs of ships, and call signs of airplanes.
Column<TEXT ENCODING DICT | BIGINT>
site_id
Column containing keys/IDs of dwell “sites” or locations that entities visit. Examples include website pages, database session IDs, ports, airport names, or binned h3 hex IDs for geographic location.
Column<TEXT ENCODING DICT | BIGINT>
ts
Column denoting the time at which an event occurred.
Column<TIMESTAMP(0|3|6|0)>
min_dwell_seconds
Constant integer value specifying the minimum number of seconds required between the first and last timestamp-ordered record for an entity_id at a site_id to constitute a valid session and compute and return an entity’s dwell time at a site. For example, if this variable is set to 3600 (one hour), but only 1800 seconds elapses between an entity’s first and last ordered timestamp records at a site, these records are not considered a valid session and a dwell time for that session is not calculated.
BIGINT (other integer types are automatically casted to BIGINT)
min_dwell_points
A constant integer value specifying the minimum number of successive observations (in ts
timestamp order) required to constitute a valid session and compute and return an entity’s dwell time at a site. For example, if this variable is set to 3, but only two consecutive records exist for a user at a site before they move to a new site, no dwell time is calculated for the user.
BIGINT (other integer types are automatically casted to BIGINT)
max_inactive_seconds
A constant integer value specifying the maximum time in seconds between two successive observations for an entity at a given site before the current session/dwell time is considered finished and a new session/dwell time is started. For example, if this variable is set to 86400 seconds (one day), and the time gap between two successive records for an entity id at a given site id is 86500 seconds, the session is considered ended at the first timestamp-ordered record, and a new session is started at the timestamp of the second record.
BIGINT (other integer types are automatically casted to BIGINT)
entity_id
The ID of the entity for the output dwell time, identical to the corresponding entity_id
column in the input.
Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the entity_id
input column type)
site_id
The site ID for the output dwell time, identical to the corresponding site_id
column in the input.
Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the site_id
input column type)
prev_site_id
The site ID for the session preceding the current session, which might be a different site_id
, the same site_id
(if successive records for an entity at the same site were split into multiple sessions because the max_inactive_seconds
threshold was exceeded), or null
if the last site_id
visited was null
.
Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the site_id
input column type)
next_site_id
The site id for the session after the current session, which might be a different site_id
, the same site_id
(if successive records for an entity at the same site were split into multiple sessions due to exceeding the max_inactive_seconds
threshold, or null
if the next site_id
visited was null
.
Column<TEXT ENCODING DICT> | Column<BIGINT> (type will be the same as the site_id
input column type)
session_id
An auto-incrementing session ID specific/relative to the current entity_id
, starting from 1 (first session) up to the total number of valid sessions for an entity_id
, such that each valid session dwell time increments the session_id
for an entity by 1.
Column<INT>
start_seq_id
The index of the nth timestamp (ts
-ordered) record for a given entity denoting the start of the current output row's session.
Column<INT>
dwell_time_sec
The duration in seconds for the session.
Column<INT>
num_dwell_points
The number of records/observations constituting the current output row's session.
Column<INT>
Example
Installing OmniSci on Docker
In this section you will find the recipes to install HEAVY.AI platform using Dcoker.
In this section you will find a recipe to install the heavy.ai platofrom on Red Hat and derivates like Rocky Linux
In this section you will find a recipe to install the heavy.ai platofrom on Red Hat and derivates like Rocky Linux
Returns metadata for one or more las
or laz
point cloud/LiDAR files from a local file or directory source, optionally constraining the bounding box for metadata retrieved to the lon/lat bounding box specified by the x_min
, x_max
, y_min
, y_max
arguments.
Note: specified path must be contained in global allowed-import-paths
, otherwise an error will be returned.
Input Arguments
path
The path of the file or directory containing the las/laz file or files. Can contain globs. Path must be in allowed-import-paths
.
TEXT ENCODING NONE
x_min
(optional)
Min x-coordinate value for point cloud files to retrieve metadata from.
DOUBLE
x_max
(optional)
Max x-coordinate value for point cloud files to retrieve metadata from.
DOUBLE
y_min
(optional)
Min y-coordinate value for point cloud files to retrieve metadata from.
DOUBLE
y_max
(optional)
Max y-coordinate value for point cloud files to retrieve metadata from.
DOUBLE
Output Columns
file_path
Full path for the las or laz file
Column<TEXT ENCODING DICT>
file_name
Filename for the las or laz file
Column<TEXT ENCODING DICT>
file_source_id
File source id per file metadata
Column<SMALLINT>
version_major
LAS version major number
Column<SMALLINT>
version_minor
LAS version minor number
Column<SMALLINT>
creation_year
Data creation year
Column<SMALLINT>
is_compressed
Whether data is compressed, i.e. LAZ format
Column<BOOLEAN>
num_points
Number of points in this file
Column<BIGINT>
num_dims
Number of data dimensions for this file
Column<SMALLINT>
point_len
Not currently used
Column<SMALLINT>
has_time
Whether data has time value
COLUMN<BOOLEAN>
has_color
Whether data contains rgb color value
COLUMN<BOOLEAN>
has_wave
Whether data contains wave info
COLUMN<BOOLEAN>
has_infrared
Whether data contains infrared value
COLUMN<BOOLEAN>
has_14_point_format
Data adheres to 14-attribute standard
COLUMN<BOOLEAN>
specified_utm_zone
UTM zone of data
Column<INT>
x_min_source
Minimum x-coordinate in source projection
Column<DOUBLE>
x_max_source
Maximum x-coordinate in source projection
Column<DOUBLE>
y_min_source
Minimum y-coordinate in source projection
Column<DOUBLE>
y_max_source
Maximum y-coordinate in source projection
Column<DOUBLE>
z_min_source
Minimum z-coordinate in source projection
Column<DOUBLE>
z_max_source
Maximum z-coordinate in source projection
Column<DOUBLE>
x_min_4326
Minimum x-coordinate in lon/lat degrees
Column<DOUBLE>
x_max_4326
Maximum x-coordinate in lon/lat degrees
Column<DOUBLE>
y_min_4326
Minimum y-coordinate in lon/lat degrees
Column<DOUBLE>
y_max_4326
Maximum y-coordinate in lon/lat degrees
Column<DOUBLE>
z_min_4326
Minimum z-coordinate in meters above sea level (AMSL)
Column<DOUBLE>
z_max_4326
Maximum z-coordinate in meters above sea level (AMSL)
Column<DOUBLE>
Example
Given a query input of entity keys, feature columns, and a metric column, and a second query input specifying a search vector of feature columns and metric, computes the similarity of each entity in the first input to the search vector based on their similarity. The score is computed as the cosine similarity of the feature column(s) for each entity with the feature column(s) for the search vector, which can optionally be TF/IDF weighted.
primary_key
Column containing keys/entity IDs that can be used to uniquely identify the entities for which the function will compute the similarity to the search vector specified by the comparison_features
cursor. Examples include countries, census block groups, user IDs of website visitors, and aircraft call signs.
Column<TEXT ENCODING DICT | INT | BIGINT>
pivot_features
One or more columns constituting a compound feature. For example, two columns of visit hour and census block group would compare entities specified by primary_key
based on whether they visited the same census block group in the same hour. If a single census block group feature column is used, the primary_key
entities are compared only by the census block groups visited, regardless of time overlap.
Column<TEXT ENCODING DICT | INT | BIGINT>
metric
Column denoting the values used as input for the cosine similarity metric computation. In many cases, this is simply COUNT(*)
such that feature overlaps are weighted by the number of co-occurrences.
Column<INT | BIGINT | FLOAT | DOUBLE>
comparison_
pivot_features
One or more columns constituting a compound feature for the search vector. This should match in number of sub-features, types, and semantics pivot features
.
Column<TEXT ENCODING DICT | INT | BIGINT>
comparison_metric
Column denoting the values used as input for the cosine similarity metric computation from the search vector. In many cases, this is simply COUNT(*)
such that feature overlaps are weighted by the number of co-occurrences.
Column<TEXT ENCODING DICT | INT | BIGINT>
use_tf_idf
BOOLEAN
class
ID of the primary key
being compared against the search vector.
Column<TEXT ENCODING DICT | INT | BIGINT> (type will be the same of primary_key
input column)
similarity_score
Computed cosine similarity score between each primary_key
pair, with values falling between 0 (completely dissimilar) and 1 (completely similar).
Column<FLOAT>
Given a query input of entity keys/IDs (for example, airplane tail numbers), a set of feature columns (for example, airports visited), and a metric column (for example number of times each airport was visited), scores each pair of entities based on their similarity. The score is computed as the cosine similarity of the feature column(s) between each entity pair, which can optionally be TF/IDF weighted.
primary_key
Column containing keys/entity IDs that can be used to uniquely identify the entities for which the function computes co-similarity. Examples include countries, census block groups, user IDs of website visitors, and aircraft callsigns.
Column<TEXT ENCODING DICT | INT | BIGINT>
pivot_features
One or more columns constituting a compound feature. For example, two columns of visit hour and census block group would compare entities specified by primary_key
based on whether they visited the same census block group in the same hour. If a single census block group feature column is used, the primary_key
entities would be compared only by the census block groups visited, regardless of time overlap.
Column<TEXT ENCODING DICT | INT | BIGINT>
metric
Column denoting the values used as input for the cosine similarity metric computation. In many cases, this is COUNT(*)
such that feature overlaps are weighted by the number of co-occurrences.
Column<INT | BIGINT | FLOAT | DOUBLE>
use_tf_idf
BOOLEAN
class1
ID of the first primary key
in the pair-wise comparison.
Column<TEXT ENCODING DICT | INT | BIGINT> (type is the same of primary_key
input column)
class2
ID of the second primary key
in the pair-wise comparison. Because the computed similarity score for a pair of primary keys
is order-invariant, results are output only for ordering such that class1
<= class2
. For primary keys of type TextEncodingDict
, the order is based on the internal integer IDs for each string value and not lexicographic ordering.
Column<TEXT ENCODING DICT | INT | BIGINT> (type is the same of primary_key
input column)
similarity_score
Computed cosine similarity score between each primary_key
pair, with values falling between 0 (completely dissimilar) and 1 (completely similar).
Column<Float>
Example
Given a distance-weighted directed graph, consisting of a queryCURSOR
input consisting of the starting and ending node for each edge and a distance, and a specified origin and destination node, tf_graph_shortest_path
computes the shortest distance-weighted path through the graph between origin_node
and destination_node
, returning a row for each node along the computed shortest path, with the traversal-ordered index of that node and the cumulative distance from the origin_node
to that node. If either origin_node
or destination_node
do not exist, an error is returned.
Input Arguments
node1
Origin node column in directed edge list CURSOR
Column< INT | BIGINT | TEXT ENCODED DICT>
node2
Destination node column in directed edge list CURSOR
Column< INT | BIGINT | TEXT ENCODED DICT> (must be the same type as node1
)
distance
Distance between origin and destination node in directed edge list CURSOR
Column< INT | BIGINT | FLOAT | DOUBLE >
origin_node
The origin node to start graph traversal from. If not a value present in edge_list.node1
, will cause empty result set to be returned.
BIGINT | TEXT ENCODED DICT
destination_node
The destination node to finish graph traversal at. If not a value present in edge_list.node1
, will cause empty result set to be returned.
BIGINT | TEXT ENCODED DICT
Output Columns
path_step
The index of this node along the path traversal from origin_node
to destination_node
, with the first node (the origin_node)
indexed as 1.
Column< INT >
node
The current node along the path traversal from origin_node
to destination_node
. The first node (as denoted by path_step
= 1) will always be the input origin_node
, and the final node (as denoted by MAX(path_step)
) will always be the input destination_node
.
Column < INT | BIGINT | TEXT ENCODED DICT> (same type as the node1
and node2
input columns)
cume_distance
The cumulative distance adding all input distance
values from the origin_node
to the current node.
Column < INT | BIGINT | FLOAT | DOUBLE> (same type as the distance
input column)
Example A
Example B
Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, with taking the maximum z value across all points in each bin as the output value for the bin. The aggregate performed to compute the value for each bin is specified by agg_type
, with allowed aggregate types of AVG
, COUNT
, SUM
, MIN
, and MAX
. If neighborhood_fill_radius
is set greater than 0, a blur pass/kernel will be computed on top of the results according to the optionally-specified fill_agg_type
, with allowed types of GAUSS_AVG, BOX_AVG
, COUNT
, SUM
, MIN
, and MAX
(if not specified, defaults to GAUSS_AVG
, or a Gaussian-average kernel). if fill_only_nulls
is set to true, only null bins from the first aggregate step will have final output values computed from the blur pass, otherwise if false all values will be affected by the blur pass.
Note that the arguments to bound the spatial output grid (x_min, x_max, y_min, y_max) are optional, however either all or none of these arguments must be supplied. If the arguments are not supplied, the bounds of the spatial output grid will be bounded by the x/y range of the input query, and if SQL filters are applied on the output of the tf_geo_rasterize
table function, these filters will also constrain the output range.
x
X-coordinate column or expression
Column<FLOAT | DOUBLE>
y
Y-coordinate column or expression
Column<FLOAT | DOUBLE>
z
Z-coordinate column or expression. The output bin is computed as the maximum z-value for all points falling in each bin.
Column<FLOAT | DOUBLE>
agg_type
The aggregate to be performed to compute the output z-column. Should be one of 'AVG'
, 'COUNT'
, 'SUM',
'MIN'
, or 'MAX'.
TEXT ENCODING NONE
fill_agg_type
(optional)
The aggregate to be performed when computing the blur pass on the output bins. Should be one of 'AVG'
, 'COUNT'
, 'SUM'
, 'MIN'
, 'MAX'
, ' 'AVG', 'COUNT', 'SUM',
'GAUSS_AVG'
, or 'BOX_AVG'
. Note that AVG
is synonymous with GAUSS_AVG
in this context, and the default fill_agg_type
if not specified is GAUSS_AVG
.
TEXT ENCODING NONE
bin_dim_meters
The width and height of each x/y bin in meters. If geographic_coords
is not set to true, the input x/y units are already assumed to be in meters.
DOUBLE
geographic_coords
If true, specifies that the input x/y coordinates are in lon/lat degrees. The function will then compute a mapping of degrees to meters based on the center coordinate between x_min/x_max and y_min/y_max.
BOOLEAN
neighborhood_fill_radius
The radius in bins to compute the box blur/filter over, such that each output bin will be the average value of all bins within neighborhood_fill_radius
bins.
DOUBLE
fill_only_nulls
Specifies that the box blur should only be used to provide output values for null output bins (i.e. bins that contained no data points or had only data points with null Z-values).
BOOLEAN
x_min
(optional)
Min x-coordinate value (in input units) for the spatial output grid.
DOUBLE
x_max
(optional)
Max x-coordinate value (in input units) for the spatial output grid.
DOUBLE
y_min
(optional)
Min y-coordinate value (in input units) for the spatial output grid.
DOUBLE
y_max
(optional)
Max y-coordinate value (in input units) for the spatial output grid.
DOUBLE
x
The x-coordinates for the centroids of the output spatial bins.
Column<FLOAT | DOUBLE> (same as input x-coordinate column/expression)
y
The y-coordinates for the centroids of the output spatial bins.
Column<FLOAT | DOUBLE> (same as input y-coordinate column/expression)
z
The maximum z-coordinate of all input data assigned to a given spatial bin.
Column<FLOAT | DOUBLE> (same as input z-coordinate column/expression)
Example
Similar to tf_geo_rasterize
, but also computes the slope and aspect per output bin.
Aggregates point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type
) across all points in each bin as the output value for the bin. A Gaussian average is then taken over the neighboring bins, with the number of bins specified by neighborhood_fill_radius
, optionally only filling in null-valued bins if fill_only_nulls
is set to true. The slope and aspect is then computed for every bin, based on the z values of that bin and its neighboring bins. The slope can be returned in degrees or as a fraction between 0 and 1, depending on the boolean argument to compute_slope_in_degrees
.
Note that the bounds of the spatial output grid will be bounded by the x/y range of the input query, and if SQL filters are applied on the output of the tf_geo_rasterize_slope
table function, these filters will also constrain the output range.
x
Input x-coordinate column or expression.
Column<FLOAT | DOUBLE>
y
Input y-coordinate column or expression.
Column<FLOAT | DOUBLE>
z
Input z-coordinate column or expression. The output bin is computed as the maximum z-value for all points falling in each bin.
Column<FLOAT | DOUBLE>
agg_type
The aggregate to be performed to compute the output z-column. Should be one of 'AVG'
, 'COUNT'
, 'SUM',
'MIN'
, or 'MAX'.
TEXT ENCODING NONE
bin_dim_meters
The width and height of each x/y bin in meters. If geographic_coords
is not set to true, the input x/y units are already assumed to be in meters.
DOUBLE
geographic_coords
If true, specifies that the input x/y coordinates are in lon/lat degrees. The function will then compute a mapping of degrees to meters based on the center coordinate between x_min/x_max and y_min/y_max.
BOOLEAN
neighborhood_fill_radius
The radius in bins to compute the box blur/filter over, such that each output bin will be the average value of all bins within neighborhood_fill_radius
bins.
BIGINT
fill_only_nulls
Specifies that the box blur should only be used to provide output values for null output bins (i.e. bins that contained no data points or had only data points with null Z-values).
BOOLEAN
compute_slope_in_degrees
If true, specifies the slope should be computed in degrees (with 0 degrees perfectly flat and 90 degrees perfectly vertical). If false, specifies the slope should be computed as a fraction from 0 (flat) to 1 (vertical). In a future release, we are planning to move the default output to percentage slope.
BOOLEAN
x
The x-coordinates for the centroids of the output spatial bins.
Column<FLOAT | DOUBLE> (same as input x column/expression)
y
The y-coordinates for the centroids of the output spatial bins.
Column<FLOAT | DOUBLE> (same as input y column/expression)
z
The maximum z-coordinate of all input data assigned to a given spatial bin.
Column<FLOAT | DOUBLE> (same as input z column/expression)
slope
The average slope of an output grid cell (in degrees or a fraction between 0 and 1, depending on the argument to compute_slope_in_degrees
).
Column<FLOAT | DOUBLE> (same as input z column/expression)
aspect
The direction from 0 to 360 degrees pointing towards the maximum downhill gradient, with 0 degrees being due north and moving clockwise from N (0°) -> NE (45°) -> E (90°) -> SE (135°) -> S (180°) -> SW (225°) -> W (270°) -> NW (315°).
Column<FLOAT | DOUBLE> (same as input z column/expression)
Example
Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type
) across all points in each bin as the output value for the bin. A Gaussian average is then taken over the neighboring bins, with the number of bins specified by neighborhood_fill_radius
, optionally only filling in null-valued bins if fill_only_nulls
is set to true.
The graph shortest path is then computed between an origin point on the grid specified by origin_x
and origin_y
and a destination point on the grid specified by destination_x
and destination_y
, where the shortest path is weighted by the nth exponent of the computed slope between a bin and its neighbors, with the nth exponent being specified by slope_weighted_exponent
. A max allowed traversable slope can be specified by slope_pct_max
, such that no traversal is considered or allowed between bins with absolute computed slopes greater than the percentage specified by slope_pct_max
.
Input Arguments
x
Input x-coordinate column or expression of the data to be rasterized.
Column <FLOAT | DOUBLE>
y
Input y-coordinate column or expression of the data to be rasterized.
Column <FLOAT | DOUBLE> (must be the same type as x
)
z
Input z-coordinate column or expression of the data to be rasterized.
Column <FLOAT | DOUBLE>
agg_type
The aggregate to be performed to compute the output z-column. Should be one of 'AVG'
, 'COUNT'
, 'SUM',
'MIN'
, or 'MAX'.
TEXT ENCODING NONE
bin_dim
The width and height of each x/y bin . If geographic_coords
is true, the input x/y units will be translated to meters according to a local coordinate transform appropriate for the x/y bounds of the data.
DOUBLE
geographic_coords
If true, specifies that the input x/y coordinates are in lon/lat degrees. The function will then compute a mapping of degrees to meters based on the center coordinate between x_min/x_max and y_min/y_max.
BOOLEAN
neighborhood_bin_radius
The radius in bins to compute the gaussian blur/filter over, such that each output bin will be the average value of all bins within neighborhood_fill_radius
bins.
BIGINT
fill_only_nulls
Specifies that the gaussian blur should only be used to provide output values for null output bins (i.e. bins that contained no data points or had only data points with null Z-values).
BOOLEAN
origin_x
The x-coordinate for the starting point for the graph traversal, in input (not bin) units.
DOUBLE
origin_y
The y-coordinate for the starting point for the graph traversal, in input (not bin) units.
DOUBLE
destination_x
The x-coordinate for the destination point for the graph traversal, in input (not bin) units.
DOUBLE
destination_y
The y-coordinate for the destination point for the graph traversal, in input (not bin) units.
DOUBLE
slope_weighted_exponent
The slope weight between neighboring raster cells will be weighted by the slope_weighted_exponent
power. A value of 1 signifies that the raw slopes between neighboring cells should be used, increasing this value from 1 will more heavily penalize paths that traverse steep slopes.
DOUBLE
slope_pct_max
The max absolute value of slopes (measured in percentages) between neighboring raster cells that will be considered for traversal. A neighboring graph cell with an absolute slope greater than this amount will not be considered in the shortest slope-weighted path graph traversal
DOUBLE
Output Columns
Computes the Mandelbrot set over the complex domain [x_min
, x_max
), [y_min
, y_max
), discretizing the xy-space into an output of dimensions x_pixels
X y_pixels
. The output for each cell is the number of iterations needed to escape to infinity, up to and including the specified max_iterations
.
x_pixels
32-bit integer
y_pixels
32-bit integer
x_min
DOUBLE
x_max
DOUBLE
y_min
DOUBLE
y_max
DOUBLE
max_iterations
32-bit integer
Example
x_pixels
32-bit integer
y_pixels
32-bit integer
x_min
DOUBLE
x_max
DOUBLE
y_min
DOUBLE
y_max
DOUBLE
max_iterations
32-bit integer
x_pixels
32-bit integer
y_pixels
32-bit integer
x_min
DOUBLE
x_max
DOUBLE
y_min
DOUBLE
y_max
DOUBLE
max_iterations
32-bit integer
x_pixels
32-bit integer
y_pixels
32-bit integer
x_min
DOUBLE
x_max
DOUBLE
y_min
DOUBLE
y_max
DOUBLE
max_iterations
32-bit integer
Given a distance-weighted directed graph, consisting of a queryCURSOR
input consisting of the starting and ending node for each edge and a distance, and a specified origin node, tf_graph_shortest_paths_distances
computes the shortest distance-weighted path distance between the origin_node
and every other node in the graph. It returns a row for each node in the graph, with output columns consisting of the input origin_node
, the given destination_node
, the distance for the shortest path between the two nodes, and the number of edges or graph "hops" between the two nodes. If origin_node
does not exist in the node1
column of the edge_list
CURSOR
, an error is returned.
Input Arguments
node1
Origin node column in directed edge list CURSOR
Column<INT | BIGINT | TEXT ENCODED DICT>
node2
Destination node column in directed edge list CURSOR
Column<INT | BIGINT | TEXT ENCODED DICT> (must be the same type as node1
)
distance
Distance between origin and destination node in directed edge list CURSOR
Column INT | BIGINT | FLOAT | DOUBLE>
origin_node
The origin node to start graph traversal from. If not a value present in edge_list.node1
, will cause empty result set to be returned.
BIGINT | TEXT ENCODED DICT
Output Columns
origin_node
Starting node in graph traversal. Always equal to input origin_node
.
Column <INT | BIGINT | TEXT ENCODED DICT> (same type as the node1
and node2
input columns)
destination_node
Final node in graph traversal. Will be equal to one of values of node2
input column.
Column <INT | BIGINT | TEXT ENCODED DICT> (same type as the node1
and node2
input columns)
distance
Cumulative distance between origin and destination node for shortest path graph traversal.
Column<INT | BIGINT | FLOAT | DOUBLE> (same type as the distance
input column)
num_edges_traversed
Number of edges (or "hops") traversed in the graph to arrive at destination_node
from origin_node
for the shortest path graph traversal between these two nodes.
Column <INT>
Example A
Example B
Loads one or more las
or laz
point cloud/LiDAR files from a local file or directory source, optionally tranforming the output SRID to out_srs
(if not specified, output points are automatically transformed to EPSG:4326 lon/lat pairs).
If use_cache
is set to true
, an internal point cloud-specific cache will be used to hold the results per input file, and if queried again will significantly speed up the query time, allowing for interactive querying of a point cloud source. If the results of tf_load_point_cloud
will only be consumed once (for example, as part of a CREATE TABLE
statement), it is highly recommended that use_cache
is set to false
or left unspecified (as it is defaulted to false
) to avoid the performance and memory overhead incurred by used of the cache.
The bounds of the data retrieved can be optionally specified with the x_min
, x_max
, y_min
, y_max
arguments. These arguments can be useful when the user desires to retrieve a small geographic area from a large point-cloud file set, as files containing data outside the bounds of the specified bounding box will be quickly skipped by tf_load_point_cloud
, only requiring a quick read of the spatial metadata for the file.
Input Arguments
path
The path of the file or directory containing the las/laz file or files. Can contain globs. Path must be in allowed-import-paths
.
TEXT ENCODING NONE
out_srs
(optional)
EPSG code of the output SRID. If not specified, output points are automatically converted to lon/lat (EPSG 4326).
TEXT ENCODING NONE
use_cache
(optional)
If true, use internal point cloud cache. Useful for inline querying of the output of tf_load_point_cloud
. Should turn off for one-shot queries or when creating a table from the output, as adding data to the cache incurs performance and memory usage overhead. If not specified, is defaulted to false
/off.
BOOLEAN
x_min
(optional)
Min x-coordinate value (in degrees) for the output data.
DOUBLE
x_max
(optional)
Max x-coordinate value (in degrees) for the output data.
DOUBLE
y_min
(optional)
Min y-coordinate value (in degrees) for the output data.
DOUBLE
y_max
(optional)
Max y-coordinate value (in degrees) for the output data.
DOUBLE
Output Columns
x
Point x-coordinate
Column<DOUBLE>
y
Point y-coordinate
Column<DOUBLE>
z
Point z-coordinate
Column<DOUBLE>
intensity
Point intensity
Column<INT>
return_num
The ordered number of the return for a given LiDAR pulse. The first returns (lowest return numbers) are generally associated with the highest-elevation points for a LiDAR pulse, i.e. the forest canopy will generally have a lower return_num
than the ground beneath it.
Column<TINYINT>
num_returns
The total number of returns for a LiDAR pulse. Multiple returns occur when there are multiple objects between the LiDAR source and the lowest ground or water elevation for a location.
Column<TINYINT>
scan_direction_flag
Column<TINYINT>
edge_of_flight_line_flag
Column<TINYINT>
classification
Column<SMALLINT>
scan_angle_rank
Column<TINYINT>
Example A
Example B
Getting Started with HEAVY.AI on Microsoft Azure
Follow these instructions to get started with HEAVY.AI on Microsoft Azure.
You must have a Microsoft Azure account. If you do not have an account, go to the Micrsoft Azure home page to sign up for one.
To launch HEAVY.AI on Microsoft Azure, you configure a GPU-enabled instance.
1) Log in to you Microsoft Azure portal.
2) On the left side menu, create a Resource group, or use one that your organization has created.
3) On the left side menu, click Virtual machines, and then click Add.
4) Create your virtual machine:
On the Basics tab:
In Project Details, specify the Resource group.
Specify the Instance Details:
Virtual machine name
Region
Image (Ubuntu 16.04 or higher, or CentOS/RHEL 7.0 or higher)
Size. Click Change size and use the Family filter to filter on GPU, based on your use case and requirements. Not all GPU VM variants are available in all regions.
For Username, add any user name other than admin.
In Inbound Port Rules, click Allow selected ports and select one or more of the following:
HTTP (80)
HTTPS (443)
SSH (22)
On the Disks tab, select Premium or Standard SSD, depending on your needs.
For the rest of the tabs and sections, use the default values.
5) Click Review + create. Azure reviews your entries, creates the required services, deploys them, and starts the VM.
6) Once the VM is running, select the VM you just created and click the Networking tab.
7) Click the Add inbound button and configure security rules to allow any source, any destination, and destination port 6273 so you can access Heavy Immerse from a browser on that port. Consider renaming the rule to 6273-Immerse or something similar so that the default name makes sense.
8) Click Add and verify that your new rule appears.
Azure-specific configuration is complete. Now, follow standard HEAVY.AI installation instructions for your Linux distribution and installation method.
Window functions allow you to work with a subset of rows related to the currently selected row. For a given dimension, you can find the most associated dimension by some other measure (for example, number of records or sum of revenue).
Window functions must always contain an OVER clause. The OVER clause splits up the rows of the query for processing by the window function.
The PARTITION BY list divides the rows into groups that share the same values of the PARTITION BY expression(s). For each row, the window function is computed using all rows in the same partition as the current row.
Rows that have the same value in the ORDER BY clause are considered peers. The ranking functions give the same answer for any two peer rows.
HeavyDB supports the aggregate functions AVG
, MIN
, MAX
, SUM
, and COUNT
in window functions.
Updates on window functions are supported, assuming the target table is single-fragment. Updates on multi-fragment target tables are not currently supported.
This query shows the top airline carrier for each state, based on the number of departures.
A window function can include a frame clause that specifies a set of neighboring rows of the current row belonging to the same partition. This allows us to compute a window aggregate function over the window frame, instead of computing it against the entire partition. Note that a window frame for the current row is computed based on either 1) the number of rows before or after the current row (called rows mode) or 2) the specified ordering column value in the frame clause (called range mode).
For example:
From the starting row of the partition to the current row: Using the sum
aggregate function, you can compute the running sum of the partition.
You can construct a frame based on the position of the rows (called rows mode): For example, a row before 3 rows and after 2 rows:
You can compute the aggregate function of the frame having up to six rows (including the current row).
You can organize a frame based on the value of the ordering column (called range mode): Assuming C as the current ordering column value, we can compute aggregate value of the window frame which contains rows having ordering column values between (C - 3) and (C + 2).
Window functions that ignore the frame are evaluated on the entire partition.
Note that we can define the window frame clause using rows mode with an ordering column.
You can use the following aggregate functions with the window frame clause.
<frame_mode>
| <frame_bound>
<frame_mode>
can be one of the following:
rows
range
1 | 2 | 3 | 4 | 5.5 | 7.5 | 8 | 9 | 10 → value of a each tuple’s order by expression.
When the current row has a value 5.5:
ROWS BETWEEN 3 PRECEDING and 3 FOLLOWING : 3 rows before and 3 rows after → {2, 3, 4, 5.5, 7.5, 8, 9 }
RANGE BETWEEN 3 PRECEDING and 3 FOLLOWING: 5.5 - 3 <= x <= 5.5 + 3 → { 3, 4, 5.5, 8 }
<frame_bound
>:
frame_start or
frame_between: between frame_start and frame_end
frame_start and frame_end can be one of the following:
UNBOUNDED PRECEDING: The start row of the partition that the current row belongs to.
UNBOUNDED FOLLOWING: The end row of the partition that the current row belongs to.
CURRENT ROW
For rows mode: the current row.
For range mode: the peers of the current row. A peer is a row having the same value as the ordering column expression of the current row. Note that all null values are peers of each other.
expr PRECEDING
For rows mode: expr row before the current row.
For range mode: rows with the current row’s ordering expression value minus expr.
For DATE, TIME, and TIMESTAMP: Use the INTERVAL keyword with a specific time unit, depending on a data type:
TIMESTAMP type: NANOSECOND, MICROSECOND, MILLISECOND, SECOND, MINUTE, HOUR, DAY, MONTH, and YEAR
TIME type: SECOND, MINUTE, and HOUR
DATE type: DAY, MONTH, and YEAR
For example:
RANGE BETWEEN INTERVAL 1 DAY PRECEDING and INTERVAL 3 DAY FOLLOWING
Currently, only literal expressions as expr such as 1 PRECEDING and 100 PRECEDING are supported.
expr FOLLOWING
For rows mode: expr row after the current row.
For range mode: rows with the current row’s ordering expression value plus expr.
For DATE, TIME, and TIMESTAMP: Use the INTERVAL keyword with a specific time unit, depending on a data type:
TIMESTAMP type: NANOSECOND, MICROSECOND, MILLISECOND, SECOND, MINUTE, HOUR, DAY, MONTH, and YEAR
TIME type: SECOND, MINUTE, and HOUR
DATE type: DAY, MONTH, and YEAR
For example:
RANGE BETWEEN INTERVAL 1 DAY PRECEDING and INTERVAL 3 DAY FOLLOWING
Currently, only support literal expression as expr such as 1 FOLLOWING and 100 FOLLOWING are supported.
UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING have the same meaning in both rows and range mode.
When the query has no window frame bound, the window aggregate function is computed differently depending on the existence of the ORDER BY clause:
Has ORDER BY clause: The window function is computed with the default frame bound, which is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
No Order BY clause: The window function is computed over the entire partition.
You can refer to the same window clause in multiple window aggregate functions by defining it with a unique name in the query definition.
For example, you can define the named window clauses W1 and W2 as follows:
Named window function clause w1
refers to a window function clause without a window frame clause, and w2
refers to a named window frame clause.
To use window framing, you may need an ORDER BY clause in the window definition. Depending on the framing mode used, the constraint varies:
Row mode: no restriction of the existence of the ordering column. It also can include multiple ordering columns.
Range mode: only a single ordering column is required (not multi-column ordering).
Currently, all window functions including aggregation over window frame are computed via CPU-mode.
For window frame bound expressions, only non-negative integer literals are supported.
GROUPING mode and EXCLUDING are not currently supported.
Process a raster input to derive contour lines or regions and output as LINESTRING or POLYGON for rendering or further processing. Each has two variants:
One that re-rasterizes the input points ()
One which accepts raw raster points directly ()
Use the rasterizing variants if the raster table rows are not already sorted in row-major order (for example, if they represent an arbitrary 2D point cloud), or if filtering or binning is required to reduce the input data to a manageable count (to speed up the contour processing) or to smooth the input data before contour processing. If the input rows do not already form a rectilinear region, the output region will be their 2D bounding box. Many of the parameters of the rasterizing variant are directly equivalent to those of ; see that function for details.
The direct variants require that the input rows represent a rectilinear region of pixels in nonsparse row-major order. The dimensions must also be provided, and (raster_width * raster_height) must match the input row count. The contour processing is then performed directly on the raster values with no preprocessing.
The line variants generate LINESTRING geometries that represent the contour lines of the raster space at the given interval with the optional given offset. For example, a raster space representing a height field with a range of 0.0 to 1000.0 will likely result in 10 or 11 lines, each with a corresponding contour_values
value, 0.0, 100.0, 200.0 etc. If contour_offset
is set to 50.0, then the lines are generated at 50.0, 150.0, 250.0, and so on. The lines can be open or closed and can form rings or terminate at the edges of the raster space.
The polygon variants generate POLYGON geometries that represent regions between contour lines (for example from 0.0 to 100.0), and from 100.0 to 200.0. If the raster space has multiple regions with that value range, then a POLYGON row is output for each of those regions. The corresponding contour_values
value for each is the lower bound of the range for that region.
Heavy Immerse is a browser-based data visualization client that runs on top of the GPU-powered HeavyDB. It provides instantaneous representations of your data, from basic charts to rich and complex visualizations.
Immerse is installed with HEAVY.AI Enterprise Edition.
To create dashboards and data visualizations, click DASHBOARDS. You can search for dashboards, and list them by most recent or alphabetically.
Click DATA to import and manipulate data.
Click SQL EDITOR to perform Data Definition and Data Manipulation tasks on the command line.
When you navigate between the three utilities, you can:
Hold the command (ctrl) key as you click a link to open the utility in a new tab/window in the background.
Hold shift+command (ctrl) as you click a link to open the utility in a new tab/window in the foreground.
Hold no keys as you click a link to replace the contents of the current window.
HELP CENTER provides access to Immerse version information, tutorials, demos, and documentation. It also includes a link for sending email to HEAVY.AI.
Clicking the user icon at the far right opens a drop-down box where you can select a different database, change your UI theme, or log out of Immerse:
The Control Panel gives super users visibility into roles and users of the current database, as well as feature flags, system table dashboards, and log files for the current HeavyDB instance.
To open the Control Panel, click the Account icon and then click Control Panel.
The Control Panel is considered beta functionality. Currently, you cannot add, delete, or edit roles or users in the Control Panel. Feature flags cannot be modified through the Control Panel.
Only super users have access to the Control Panel.
To see which feature flags are currently set in Immerse, click Feature Flags under Customization.
Currently, feature flags can only be viewed in Immerse; they cannot be set or removed.
Links to the the following System Table dashboards are available on the Control Panel:
Links to the following log files are are available on the Control Panel:
Connect to Heavy Immerse by pointing a web browser to port 6273 on your HeavyDB server. When you launch Immerse, the landing page shows a list of saved dashboards. The number to the right of the Search box shows the number of dashboards displayed (left side of the slash) and the number of total dashboards (right side of the slash). Because no filters are applied here, these numbers are the same (5446).
You can:
Search for dashboards by name, source, or owner by entering a string in the search box.
Sort dashboards by name, source, modified date, owner, and whether the dashboard is shared or not. By default, dashboards are sorted by last modified. Click a column heading to sort by information defined by that column, and click again to toggle sorting the sorting order.
Filter dashboards by name, source, last modified date, owner, and shared status.
Create and save dashboard views that show a particular set of dashboards based on filter criteria you set.
Delete, download (export), and duplicate individual dashboards.
Perform bulk actions on selected dashboards.
You can use filters to define the dashboards that are displayed, and then save the view defined by any filters you apply. To create a filter click the plus icon (+) to the left of the Search box.
Use one or more filters to define the view. For example, that following view shows dashboards with source flights and owner mapd:
You can then sort that filtered view the same way you would an unfiltered view.
To save the view, in the View name box, click the pencil icon and enter the name of the view, and then click the Save icon. Here, the view will be saved as flights - mapd.
Click the down arrow in the View name box to see the list of available filters. You can also toggle the selected filter on and off.
You can change the filters in a specified view and update it. For example, here the flights - mapd view has been updated to include only dashboards modified in 2019, so the Save icon is highlighted; click it to update the view definition.
Or, you can click the down arrow and then click + Add filter view to create a new filter view based on the updated filters. You can also duplicate or remove filter views.
A filter view is available only to the Immerse user who creates the view. If you log out of Immerse and then restart, you start with the view that you were using when you logged out.
You can select dashboards and perform the following actions. Select dashboards by clicking the box the the left of the dashboard name.
Export - Export (download) all selected dashboards as individual .json files.
Share/Unshare - Share or unshare selected dashboards with specific users or roles. If a user has been assigned the restricted_sharing role, sharing dashboards is unavailable.
Delete - Delete selected dashboards.
Getting Started with HEAVY.AI on Google Cloud Platform
Follow these instructions to get started with HEAVY.AI on Google Cloud Platform (GCP).
You must have a Google Cloud Platform account. If you do not have an account, follow to sign up for one.
To launch HEAVY.AI on Google Cloud Platform, you select and configure an instance.
On the solution Launcher Page, click Launch on Compute Engine to begin configuring your deployment.
Before deploying a solution with a GPU machine type, avoid potential deployment failure by to make sure that you have not exceeded your limit.
To launch HEAVY.AI on Google Cloud Platform, you select and configure a GPU-enabled instance.
Search for HEAVY.AI on the , and select a solution. OmniSci has four instance types:
.
.
.
.
On the solution Launcher Page, click Launch to begin configuring your deployment.
On the new deployment page, configure the following:
Deployment name
Zone
Machine type - Click Customize and configure Cores and Memory, and select Extend memory if necessary.
GPU type. (Not applicable for CPU configurations.)
Number of GPUs - (Not applicable for CPU configurations.) Select the number of GPUs; subject to quota and GPU type by region. For more information about GPU-equipped instances and associated resources, see .
Boot disk type
Boot disk size in GB
Networking - Set the Network, Subnetwork, and External IP.
Firewall - Select the required ports to allow TCP-based connectivity to HEAVY.AI. Click More to set IP ranges for port traffic and IP forwarding.
Accept the GCP Marketplace Terms of Service and click Deploy.
In the Deployment Manager, click the instance that you deployed.
Launch the Heavy Immerse client:
Record the Admin password (Temporary).
Click the Site address link to go to the Heavy Immerse login page. Enter the password you recorded, and click Connect.
Copy your license key from the registration email message. If you have not received your license key, contact your Sales Representative or register for your 30-day trial .
Connect to Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273
.
When prompted, paste your license key in the text box and click Apply.
Click Connect to start using HEAVY.AI.
On successful login, you see a list of sample dashboards loaded into your instance.
When installing a distributed cluster, you must run initdb --skip-geo
to avoid the automatic creation of the sample geospatial data table. Otherwise, metadata across the cluster falls out of synchronization and can put the server in an unusable state.
HEAVY.AI supports distributed configuration, which allows single queries to span more than one physical host when the scale of the data is too large to fit on a single machine.
In addition to increased capacity, distributed configuration has other advantages:
Writes to the database can be distributed across the nodes, thereby speeding up import.
Reads from disk are accelerated.
Additional GPUs in a distributed cluster can significantly increase read performance in many usage scenarios. Performance scales linearly, or near linearly, with the number of GPUs, for simple queries requiring little communication between servers.
Multiple GPUs across the cluster query data on their local hosts. This allows processing of larger datasets, distributed across multiple servers.
A HEAVY.AI distributed database consists of three components:
An aggregator, which is a specialized HeavyDB instance for managing the cluster
One or more leaf nodes, each being a complete HeavyDB instance for storing and querying data
A String Dictionary Server, which is a centralized repository for all dictionary-encoded items
Conceptually, a HEAVY.AI distributed database is horizontally sharded across n leaf nodes. Each leaf node holds one nth of the total dataset. Sharding currently is round-robin only. Queries and responses are orchestrated by a HEAVY.AI Aggregator server.
Clients interact with the aggregator. The aggregator orchestrates execution of a query across the appropriate leaf nodes. The aggregator composes the steps of the query execution plan to send to each leaf node, and manages their results. The full query execution might require multiple iterations between the aggregator and leaf nodes before returning a result to the client.
A core feature of the HeavyDB is back-end, GPU-based rendering for data-rich charts such as point maps. When running as a distributed cluster, the backend rendering is distributed across all leaf nodes, and the aggregator composes the final image.
The String Dictionary Server manages and allocates IDs for dictionary-encoded fields, ensuring that these IDs are consistent across the entire cluster.
The server creates a new ID for each new encoded value. For queries returning results from encoded fields, the IDs are automatically converted to the original values by the aggregator. Leaf nodes use the string dictionary for processing joins on encoded columns.
For moderately sized configurations, the String Dictionary Server can share a host with a leaf node. For larger clusters, this service can be configured to run on a small, separate CPU-only server.
A table is split by default to 1/nth of the complete dataset. When you create a table used to provide dimension information, you can improve performance by replicating its contents onto every leaf node using the partitions property. For example:
This reduces the distribution overhead during query execution in cases where sharding is not possible or appropriate. This is most useful for relatively small, heavily used dimension tables.
You can load data to a HEAVY.AI distributed cluster using a COPY FROM statement to load data to the aggregator, exactly as with HEAVY.AI single-node processing. The aggregator distributes data evenly across the leaf nodes.
Records transferred between systems in a HEAVY.AI cluster are compressed to improve performance. HEAVY.AI uses the LZ4_HC compressor by default. It is the fastest compressor, but has the lowest compression rate of the available algorithms. The time required to compress each buffer is directly proportional to the final compressed size of the data. A better compression rate will likely require more time to process.
You can specify another compressor on server startup using the runtime flag compressor
. Compressor choices include:
blosclz
lz4
lz4hc
snappy
zlib
zstd
For more information on the compressors used with HEAVY.AI, see also:
HEAVY.AI does not compress the payload until it reaches a certain size. The default size limit is 512MB. You can change the size using the runtime flag compression-limit-bytes
.
This example uses four GPU-based machines, each with a combination of one or more CPUs and GPUs.
Install HEAVY.AI server on each node. For larger deployments, you can have the install on a shared drive.
Set up the configuration file for the entire cluster. This file is the same for all nodes.
In the cluster.conf
file, the location of each leaf node is identified as well as the location of the String Dictionary server.
Here, dbleaf is a leaf node, and string is the String Dictionary Server. The port each node is listening on is also identified. These ports must match the ports configured on the individual server.
Each leaf node requires a heavy.conf
configuration file.
The parameter string-servers
identifies the file containing the cluster configuration, to tell the leaf node where the String Dictionary Server is.
The aggregator node requires a slightly different heavy.conf
. The file is named heavy-agg.conf
in this example.
heavy
-agg.confThe parameter cluster
tells the HeavyDB instance that it is an aggregator node, and where to find the rest of its cluster.
If your aggregator node is sharing a machine with a leaf node, there might be a conflict on the calcite-port
. Consider changing the port number of the aggregator node to another that is not in use.
Contact HEAVY.AI support for assistance with HEAVY.AI Distributed Cluster implementation.
If there is a potential for duplicate entries and you want to avoid loading duplicate rows, see on the Troubleshooting page.
You can use Heavy Immerse to import geospatial data into HeavyDB.
Supported formats include:
Keyhole Markup Language (.kml
)
GeoJSON (.geojson
)
Shapefiles (.shp
)
FlatGeobuf (.fgb
)
Shapefiles include four mandatory files: .shp
, .shx
, .dbf
, and .prj
. If you do not import the .prj
file, the coordinate system will be incorrect and you cannot render the shapes on a map.
To import geospatial definition data:
Open Heavy Immerse.
Click Data Manager.
Click Import Data.
Choose whether to import from a local file or an Amazon S3 instance. For details on importing from Amazon S3, see .
Click the large +
icon to select files for upload, or drag and drop the files to the Data Importer screen.
When importing shapefiles, upload all required file types at the same time. If you upload them separately, Heavy Immerse issues an error message.
Wait for the uploads to complete (indicated by green checkmarks on the file icons), then click Preview.
On the Data Preview screen:
Edit the column headers (if needed).
Enter a name for the table in the field at the bottom of the screen.
If you are loading the data files into a distributed system, verify under Import Settings that the Replicate Table checkbox is selected.
Click Import Data.
On the Successfully Imported Table screen, verify the rows and columns that compose your data table.
When representing longitude and latitude in HEAVY.AI geospatial primitives, the first coordinate is assumed to be longitude by default.
You can use heavysql
to define tables with columns that store WKT geospatial objects.
You can use heavysql
to insert data as WKT string values.
You can insert data from CSV/TSV files containing WKT strings. HEAVY.AI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI.
You can use your own custom delimiter in your data files.
You can import CSV and TSV files for tables that store longitude and latitude as either:
Separate consecutive scalar columns
A POINT field.
If the data is stored as a POINT, you can use spatial functions like ST_Distance
and ST_Contains
. When location data are stored as a POINT column, they are displayed as such when querying the table:
If two geometries are used in one operation (for example, in ST_Distance
), the SRID values need to match.
If you are using heavysql, create the table in HEAVY.AI with the POINT field defined as below:
Then, import the file using COPY FROM
in heavysql. By default, the two columns as consumed as longitude x
and then latitude y
. If the order of the coordinates in the CSV file is reversed, load the data using the WITH option lonlat='false'
:
Columns can exist on either side of the point field; the lon/lat in the source file does not have to be at the beginning or end of the target table. Fields can exist on either side of the lon/lat pair.
If the imported coordinates are not 4326---for example, 2263---you can transform them to 4326 on the fly:
In Immerse, you define the table when loading the data instead of predefining it before import. Immerse supports appending data to a table by loading one or more files.
Longitude and latitude can be imported as separate columns.
You can create geo tables by importing specific geo file formats. HEAVY.AI supports the following types:
ESRI shapefile (.shp
and associated files)
GeoJSON (.geojson
or .json
)
KML (.kml
or .kmz
)
ESRI file geodatabase (.gdb
)
You import geo files using the COPY FROM
command with the geo
option:
The geo file import process automatically creates the table by detecting the column names and types explicitly described in the geo file header. It then creates a single geo column (always called heavyai_geo) that is of one of the supported types (POINT
, MULTIPOINT
, LINESTRING
, MULTILINESTRING
, POLYGON
, or MULTIPOLYGON
).
In Release 6.2 and higher, polygon render metadata assignment is disabled by default. This data is no longer required by the new polygon rendering algorithm introduced in Release 6.0. The new default results in significantly faster import for polygon table imports, particularly high-cardinality tables.
If you need to revert to the legacy polygon rendering algorithm, polygons from tables imported in Release 6.2 may not render correctly. Those tables must be re-imported after setting the server configuration flag enable-assign-render-groups
to true
.
The legacy polygon rendering algorithm and polygon render metadata server config will be removed completely in an upcoming release.
Due to the prevalence of mixed POLYGON/MULTIPOLYGON
geo files (and CSVs), if HEAVY.AI detects a POLYGON
type geo file, HEAVY.AI creates a MULTIPOLYGON
column and imports the data as single polygons.
If the table does not already exist, it is created automatically.
If the table already exists, and the data in the geo file has exactly the same column structure, the new file is appended to the existing table. This enables import of large geo data sets split across multiple files. The new file is rejected if it does not have the same column structure.
By default, geo data is stored as GEOMETRY
.
You can also create tables with coordinates in SRID 3857 or SRID 900913 (Google Web Mercator). Importing data from shapefiles using SRID 3857 or 900913 is supported; importing data from delimited files into tables with these SRIDs is not supported at this time. To explicitly store in other formats, use the following WITH
options in addition to geo='true':
Compression used:
COMPRESSED(32)
- 50% compression (default)
None
- No compression
Spatial reference identifier (SRID) type:
4326
- EPSG:4326 (default)
900913
- Google Web Mercator
3857
- EPSG:3857
For example, the following explicitly sets the default values for encoding and SRID:
Note that rendering of geo MULTIPOINT is not yet supported.
An ESRI file geodatabase (.gdb
) provides a method of storing GIS information in one large file that can have one or more "layers", with each layer containing disparate but related data. The data in each layer can be of different types. Importing a .gdb
file results in the creation of one table for each layer in the file. You import an ESRI file geodatabase the same way that you import other geo file formats, using the COPY FROM
command with the geo
option:
The layers in the file are scanned and defined by name and contents. Contents are classified as EMPTY
, GEO
, NON_GEO
or UNSUPPORTED_GEO
:
EMPTY
layers are skipped because they contain no useful data.
GEO
layers contain one or more geo columns of a supported type (POINT
, MULTIPOINT
, LINESTRING
, MULTILINESTRING
, POLYGON
, MULTIPOLYGON
) and one or more regular columns, and can be imported to a single table in the same way as the other geo file formats.
NON_GEO
layers contain no geo columns and one or more regular columns, and can be imported to a regular table. Although the data comes from a geo file, data in this layer does not result in a geo table.
UNSUPPORTED_GEO
layers contain geo columns of a type not currently supported (for example, GEOMETRYCOLLECTION
). These layers are skipped because they cannot be imported completely.
A single COPY FROM
command can result in multiple tables, one for each layer in the file. The table names are automatically generated by appending the layer name to the provided table name.
For example, consider the geodatabase file mydata.gdb
which contains two importable layers with names A
and B
. Running COPY FROM
creates two tables, mydata_A
and mydata_B
, with the data from layers A
and B
, respectively. The layer names are appended to the provided table name. If the geodatabase file only contains one layer, the layer name is not appended.
You can load one specific layer from the geodatabase file by using the geo_layer_name
option:
This loads only layer A, if it is importable. The resulting table is called mydata
, and the layer name is not appended. Use this import method if you want to set a different name for each table. If the layer name from the geodatabase file would result in an illegal table name when appended, the name is sanitized by removing any illegal characters.
You can import geo files directly from archive files (for example, .zip .tar .tgz .tar.gz) without unpacking the archive. You can directly import individual geo files compressed with Zip or GZip (GeoJSON and KML only). The server opens the archive header and loads the first candidate file it finds (.shp .geojson .json .kml), along with any associated files (in the case of an ESRI Shapefile, the associated files must be siblings of the first).
You can import geo files or archives directly from an Amazon S3 bucket.
You can provide Amazon S3 credentials, if required, by setting variables in the environment of the heavysql
process…
You can also provide your credentials explicitly in the COPY FROM command.
You can import geo files or archives directly from an HTTP/HTTPS website.
You can extend a column type specification to include spatial reference (SRID) and compression mode information.
Geospatial objects declared with SRID 4326 are compressed 50% by default with ENCODING COMPRESSED(32)
. In the following definition of table geo2, the columns poly2 and mpoly2 are compressed.
COMPRESSED(32)
compression maps lon/lat degree ranges to 32-bit integers, providing a smaller memory footprint and faster query execution. The effect on precision is small, approximately 4 inches at the equator.
You can disable compression by explicitly choosing ENCODING NONE
.
You can extend a column type specification to include spatial reference (SRID) and compression mode information.
Geospatial objects declared with SRID 4326 are compressed 50% by default with ENCODING COMPRESSED(32)
. In the following definition of table geo2, the columns poly2 and mpoly2 are compressed.
COMPRESSED(32)
compression maps lon/lat degree ranges to 32-bit integers, providing a smaller memory footprint and faster query execution. The effect on precision is small, approximately 4 inches at the equator.
You can disable compression by explicitly choosing ENCODING NONE
.
Boolean constant denoting whether weighting should be used in the cosine similarity score computation.
Boolean constant denoting whether weighting should be used in the cosine similarity score computation.
From the : "The scan direction flag denotes the direction at which the scanner mirror was traveling at the time of the output pulse. A bit value of 1 is a positive scan direction, and a bit value of 0 is a negative scan direction."
From the : "The edge of flight line data bit has a value of 1 only when the point is at the end of a scan. It is the last point on a given scan line before it changes direction."
From the : "The classification field is a number to signify a given classification during filter processing. The ASPRS standard has a public list of classifications which shall be used when mixing vendor specific user software."
From the : "The angle at which the laser point was output from the laser system, including the roll of the aircraft... The scan angle is an angle based on 0 degrees being NADIR, and –90 degrees to the left side of the aircraft in the direction of flight."
Super users can restrict users who have the immerse_trial_mode role from downloading (exporting) dashboards by enabling Trial Mode. To enable trial mode, set the to TRUE.
You can import spatial representations in format. WKT is a text markup language for representing vector geometry objects on a map, spatial reference systems of spatial objects, and transformations between spatial reference systems.
HEAVY.AI accepts data with any SRID, or with no SRID. HEAVY.AI supports SRID 4326 (), and allows projections from SRID 4326 to SRID 900913 (Google Web Mercator). Geometries declared with SRID 4326 are compressed by default, and can be rendered and used to calculate geodesic distance. Geometries declared with any other SRID, or no SRID, are treated as planar geometries; the SRIDs are ignored.
An ESRI file geodatabase can have multiple layers, and importing it results in the creation of one table for each layer in the file. This behavior differs from that of importing shapefiles, GeoJSON, or KML files, which results in a single table. See for more information.
Rendering of geo LINESTRING, MULTILINESTRING
, POLYGON
and MULTIPOLYGON
is possible only with data stored in the default lon/lat WGS84 (SRID 4326) format, although the type and encoding are flexible. Unless compression is explictly disabled (NONE
), all SRID 4326 geometries are compressed. For more information, see.
Function
Description
BACKWARD_FILL(value)
Replace the null value by using the nearest non-null value of the value column, using backward search.
For example, for column x,
with the current row r
at the index K
having a NULL value, and assuming column x
has N
rows (where K < N
):
BACKWARD_FILL(x)
searches for the non-NULL value by searching rows with the index starting from K+1
to N
. The NULL value is replaced with the first non-NULL value found.
At least one ordering column must be defined in the window clause.
NULLS FIRST ordering of the input value is added automatically for any user-defined ordering of the input value. For example:
BACKWARD_FILL(x) OVER (PARTITION BY c ORDER BY x)
- No ordering is added; ordering already exists on x
.
BACKWARD_FILL(x) OVER (PARTITION BY c ORDER BY o)
- Ordering is added internally for a consistent query result.
COUNT_IF(condition_expr)
Aggregate function that can be used as a window function for both a nonframed window partition and a window frame. Returns the number of rows satisfying the given condition_expr
, which must evaluate to a Boolean value (TRUE/FALSE) like x IS NULL
or x > 1
.
CUME_DIST()
Cumulative distribution value of the current row: (number of rows preceding or peers of the current row)/(total rows). Window framing is ignored.
DENSE_RANK()
Rank of the current row without gaps. This function counts peer groups. Window framing is ignored.
FIRST_VALUE(value)
Returns the value from the first row of the window frame (the rows from the start of the partition to the last peer of the current row).
FORWARD_FILL(value)
Replace the null value by using the nearest non-null value of the value column, using forward search.
For example, for column x,
with the current row r
at the index K
having a NULL value, and assuming column x
has N
rows (where K < N
):
FORWARD_FILL(x)
searches for the non-NULL value by searching rows with the index starting from K-1
to 1
. The NULL value is replaced with the first non-NULL value found.
At least one ordering column must be defined in the window clause.
NULLS FIRST ordering of the input value is added automatically for any user-defined ordering of the input value. For example:
FORWARD_FILL(x) OVER (PARTITION BY c ORDER BY x)
- No ordering is added; ordering already exists on x
.
FORWARD_FILL(x) OVER (PARTITION BY c ORDER BY o)
- Ordering is added internally for a consistent query result.
LAG(value, offset)
Returns the value at the row that is offset rows before the current row within the partition. LAG_IN_FRAME
is the window-frame-aware version.
LAST_VALUE(value)
Returns the value from the last row of the window frame.
LEAD(value, offset)
Returns the value at the row that is offset rows after the current row within the partition. LEAD_IN_FRAME
is the window-frame-aware version.
NTH_VALUE(expr,N)
Returns a value of expr
at row N
of the window partition.
NTILE(num_buckets)
Subdivide the partition into buckets. If the total number of rows is divisible by num_buckets, each bucket has a equal number of rows. If the total is not divisible by num_buckets, the function returns groups of two sizes with a difference of 1. Window framing is ignored.
PERCENT_RANK()
Relative rank of the current row: (rank-1)/(total rows-1). Window framing is ignored.
RANK()
Rank of the current row with gaps. Equal to the row_number
of its first peer.
ROW_NUMBER()
Number of the current row within the partition, counting from 1. Window framing is ignored.
SUM_IF(condition_expr)
Aggregate function that can be used as a window function for both a nonframed window partition and a window frame. Returns the sum of all expression values satisfying the given condition_expr
. Applies to numeric data types.
Frame aggregation
MIN(val)
, MAX(val)
, COUNT(val)
, AVG(val)
, SUM(val)
Frame navigation
LEAD_IN_FRAME(value, offset)
LAG_IN_FRAME(value, offset)
FIRST_VALUE_IN_FRAME
LAST_VALUE_IN_FRAME
NTH_VALUE_IN_FRAME
These are window-frame-aware versions of the LEAD
, LAG , FIRST_VALUE, LAST_VALUE, and NTH_VALUE
functions.
Function
Arguments and Return
convert_meters_to_merc_pixel_width(meters, lon, lat,
min_lon
,
max_lon
,
img_width
,
min_width
)
Converts a distance in meters in a longitudinal direction from a latitude/longitude coordinate to a pixel size using mercator projection:
meters
: Distance in meters in a longitudinal direction to convert to pixel units.
lon
: Longitude coordinate of the center point to size from.
lat
: Latitude coordinate of the center point to size from.
min_lon
: Minimum longitude coordinate of the mercator-projected view.
max_lon
: Maximum longitude coordinate of the mercator-projected view.
img_width
: The width in pixels of the view.
min_width
: Clamps the returned pixel size to be at least this width.
Returns: Floating-point value in pixel units. Can be used for the width of a symbol or a point in Vega.
convert_meters_to_merc_pixel_height(meters, lon, lat,
min_lat
,
max_lat
,
img_height
,
min_height
)
Converts a distance in meters in a latitudinal direction from a latitude/longitude coordinate to a pixel size, using mercator projection:
meters
: Distance in meters in a longitudinal direction to convert to pixel units.
lon
: Longitude coordinate of the center point to size from.
lat
: Latitude coordinate of the center point to size from.
min_lat
: Minimum latitude coordinate of the mercator-projected view.
max_lat
: Maximum latitude coordinate of the mercator-projected view.
img_height
: The height in pixels of the view.
min_height
: Clamps the returned pixel size to be at least this height.
Returns: Floating-point value in pixel units. Can be used for the height of a symbol or a point in Vega.
convert_meters_to_pixel_width(meters,
pt
,
min_lon
,
max_lon
,
img_width
,
min_width
)
Converts a distance in meters in a longitudinal direction from a latitude/longitude POINT to a pixel size. Supports only mercator-projected points.
meters
: Distance in meters in a latitudinal direction to convert to pixel units.
pt
: The center POINT to size from. The point must be defined in the EPSG:4326 spatial reference system.
min_lon
: Minimum longitude coordinate of the mercator-projected view.
max_lon
: Maximum longitude coordinate of the mercator-projected view.
img_width
: The width in pixels of the view.
min_width
: Clamps the returned pixel size to be at least this width.
Returns: Floating-point value in pixel units. Can be used for the width of a symbol or a point in Vega.
convert_meters_to_pixel_height(meters, pt,
min_lat
,
max_lat
,
img_height
,
min_height
)
Converts a distance in meters in a latitudinal direction from an EPSG:4326 POINT to a pixel size. Currently only supports mercator-projected points:
meters
: Distance in meters in a longitudinal direction to convert to pixel units.
pt
: The center POINT to size from. The point must be defined in the EPSG:4326 spatial reference system.
min_lat
: Minimum latitude coordinate of the mercator-projected view.
max_lat
: Maximum latitude coordinate of the mercator-projected view.
img_height
: The height in pixels of the view.
min_height
: Clamps the returned pixel size to be at least this height.
Returns: Floating-point value in pixel units. Can be used for the height of a symbol or a point in Vega.
is_point_in_merc_view(lon, lat,
min_lon
,
max_lon
,
min_lat
,
max_lat
)
Returns true if a latitude/longitude coordinate is within a mercator-projected view defined by min_lon
/max_lon
, min_lat
/max_lat
.
lon
: Longitude coordinate of the point.
lat
: Latitude coordinate of the point.
min_lon
: Minimum longitude coordinate of the mercator-projected view.
max_lon
: Maximum longitude coordinate of the mercator-projected view.
min_lat
: Minimum latitude coordinate of the mercator-projected view.
max_lat
: Maximum latitude coordinate of the mercator-projected view.
Returns:True if the point is within the view defined by the min_lon
/max_lon
, min_lat
/max_lat
; otherwise, false.
is_point_size_in_merc_view(lon, lat,
meters
,
min_lon
,
max_lon
,
min_lat
,
max_lat
)
Returns true if a latitude/longitude coordinate, offset by a distance in meters, is within a mercator-projected view defined by min_lon
/max_lon
, min_lat
/max_lat
.
lon
: Longitude coordinate of the point.
lat
: Latitude coordinate of the point.
meters
: Distance in meters to offset the point by, in any direction.
min_lon
: Minimum longitude coordinate of the mercator-projected view.
max_lon
: Maximum longitude coordinate of the mercator-projected view.
min_lat
: Minimum latitude coordinate of the mercator-projected view.
max_lat
: Maximum latitude coordinate of the mercator-projected view.
Returns: True if the point is within the view defined by the min_lon
/max_lon
, min_lat
/max_lat
; otherwise, false.
is_point_in_view(pt,
min_lon
,
max_lon
,
min_lat
,
max_lat
)
Returns true if a latitude/longitude POINT defined in EPSG:4326 is within a mercator-projected view defined by min_lon
/max_lon
, min_lat
/max_lat
.
pt
: The POINT to check. Must be defined in EPSG:4326 spatial reference system.
min_lon
: Minimum longitude coordinate of the mercator-projected view.
max_lon
: Maximum longitude coordinate of the mercator-projected view.
min_lat
: Minimum latitude coordinate of the mercator-projected view.
max_lat
: Maximum latitude coordinate of the mercator-projected view.
Returns: True if the point is within the view defined by min_lon
/max_lon
, min_lat
/max_lat
; otherwise, false.
is_point_size_in_view(pt, meters,
min_lon
,
max_lon
,
min_lat
,
max_lat
)
Returns true if a latitude/longitude POINT defined in EPSG:4326 is within a mercator-projected view defined by min_lon
/max_lon
, min_lat
/max_lat
.
pt
: The POINT to check. Must be defined in EPSG:4326 spatial reference system.
meters
: Distance in meters to offset the point by, in any direction.
min_lon
: Minimum longitude coordinate of the mercator-projected view.
max_lon
: Maximum longitude coordinate of the mercator-projected view.
min_lat
: Minimum latitude coordinate of the mercator-projected view.
max_lat
: Maximum latitude coordinate of the mercator-projected view.
Returns: True if a latitude/longitude POINT defined in EPSG:4326, offset by a distance in meters, is within the view defined by min_lon
/max_lon
, min_lat
/max_lat
; otherwise, false.
lon
Longitude value of raster point (degrees, SRID 4326).
Column<FLOAT | DOUBLE>
lat
Latitude value of raster point (degrees, SRID 4326).
Column<FLOAT | DOUBLE> (must be the same as <lon>)
value
Raster band value from which to derive contours.
Column<FLOAT | DOUBLE>
agg_type
See tf_geo_rasterize
bin_dim_meters
See tf_geo_rasterize
neighborhood_fill_radius
See tf_geo_rasterize
fill_only_nulls
See tf_geo_rasterize
fill_agg_type
See tf_geo_rasterize
flip_latitude
Optionally flip resulting geometries in latitude (default FALSE).
(This parameter may be removed in future releases)
BOOLEAN
contour_interval
Desired contour interval. The function will generate a line at each interval, or a polygon region that covers that interval.
FLOAT/DOUBLE (must be same type as value)
contour_offset
Optional offset for resulting intervals.
FLOAT/DOUBLE (must be same type as value)
raster_width
Pixel width (stride) of the raster data.
INTEGER
raster_height
Pixel height of the raster data.
INTEGER
contour_[lines|polygons]
Output geometries.
Column<LINESTRING | POLYGON>
contour_values
Raster values associated with each contour geometry.
Column<FLOAT | DOUBLE> (will be the same type as value)
Hostname
IP
Role(s)
Node1
10.10.10.1
Leaf, Aggregator
Node2
10.10.10.2
Leaf, String Dictionary Server
Node3
10.10.10.3
Leaf
Node4
10.10.10.4
Leaf
You can construct a Heavy Immerse dashboard following these steps:
Once you save the dashboard, you can share it with other HEAVY.AI users.
Connect to Immerse by pointing a web browser to port 6273 on your HeavyDB server. When you launch Immerse, the landing page is a list of saved dashboards. You click New Dashboard in the upper right to configure a custom dashboard.
To add a chart, you click Add Chart, choose a chart type, set dimensions and measures, then click Apply. For more information on creating charts, see Heavy Immerse Chart Types.
To create a chart:
Click Add Chart.
Choose a Data Source. For example, UFO_Sightings.
Choose a chart type. For example, Bar.
Set the Dimension. For example, country.
Set the Measure. For example, COUNT # Records.
Click Apply.
To remove a chart:
Hover the mouse over the chart.
In the upper-right corner of the chart, click the More Options icon, and then click Remove Chart.
To title and save a dashboard:
Click the title area.
Type a title.
Click Save.
Dashboard tabs enable you add multiple pages to a dashboard. Using tabs can reduce the number of charts on a dashboard page and make it easier to find the chart you want.
By default, dashboard tabs are disabled. To enable tabs, in your server.json file, set "ui/dashboard-tabs"
to "true"
.
Dashboard tabs are located at the bottom left of the dashboard. The dashboard shown below has three tabs: Config UI (selected tab), Locked axis on scatter, and New Combo improvements:
Click a tab to open it, or use the right arrow icon to move to the next tab. Hovering on a tab reveals the three-dot menu, which you can use to duplicate, rename, or delete a tab.
Using a tabbed dashboard affects some dashboard actions you take. Refresh and Add Chart affect only the tab that you are currently viewing. Share, Duplicate, and Save affect all tabs on the dashboard.
To delete a dashboard:
Click Dashboards.
Mouse over the dashboard you want to delete.
Click X at the end of the dashboard row.
Shows generated Intermediate Representation (IR) code, identifying whether it is executed on GPU or CPU. This is primarily used internally by HEAVY.AI to monitor behavior.
For example, when you use the EXPLAIN
command on a basic statement, the utility returns 90 lines of IR code that is not meant to be human readable. However, at the top of the listing, a heading indicates whether it is IR for the CPU
or IR for the GPU
, which can be useful to know in some situations.
Returns a relational algebra tree describing the high-level plan to execute the statement.
The table below lists the relational algebra classes used to describe the execution plan for a SQL statement.
Method
Description
LogicalAggregate
Operator that eliminates duplicates and computes totals.
LogicalCalc
Expression that computes project expressions and also filters.
LogicalChi
Operator that converts a stream to a relation.
LogicalCorrelate
Operator that performs nested-loop joins.
LogicalDelta
Operator that converts a relation to a stream.
LogicalExchange
Expression that imposes a particular distribution on its input without otherwise changing its content.
LogicalFilter
Expression that iterates over its input and returns elements for which a condition evaluates to true.
LogicalIntersect
Expression that returns the intersection of the rows of its inputs.
LogicalJoin
Expression that combines two relational expressions according to some condition.
LogicalMatch
Expression that represents a MATCH_RECOGNIZE node.
LogicalMinus
Expression that returns the rows of its first input minus any matching rows from its other inputs. Corresponds to the SQL EXCEPT operator.
LogicalProject
Expression that computes a set of ‘select expressions’ from its input relational expression.
LogicalSort
Expression that imposes a particular sort order on its input without otherwise changing its content.
LogicalTableFunctionScan
Expression that calls a table-valued function.
LogicalTableModify
Expression that modifies a table. Similar to TableScan, but represents a request to modify a table instead of read from it.
LogicalTableScan
Reads all the rows from a RelOptTable.
LogicalUnion
Expression that returns the union of the rows of its inputs, optionally eliminating duplicates.
LogicalValues
Expression for which the value is a sequence of zero or more literal row values.
LogicalWindow
For example, a SELECT
statement is described as a table scan and projection.
If you add a sort order, the table projection is folded under a LogicalSort
procedure.
When the SQL statement is simple, the EXPLAIN CALCITE version is actually less “human readable.” EXPLAIN CALCITE is more useful when you work with more complex SQL statements, like the one that follows. This query performs a scan on the BOOK table before scanning the BOOK_ORDER table.
Revising the original SQL command results in a more natural selection order and a more performant query.
Changes the values of the specified columns based on the assign
argument (identifier=expression
) in all rows that satisfy the condition in the WHERE
clause.
Currently, HEAVY.AI does not support updating a geo column type (POINT, MULTIPOINT, LINESTRING, MULTILINESTRING, POLYGON, or MULTIPOLYGON) in a table.
You can update a table via subquery, which allows you to update based on calculations performed on another table.
Examples
In Release 6.4 and higher, you can run UPDATE queries across tables in different databases on the same HEAVY.AI cluster without having to first connect to those databases.
To execute queries against another database, you must have ACCESS privilege on that database, as well as UPDATE privilege.
Update a row in a table in the my_other_db
database:
Expression representing a set of window aggregates. See
Generate a series of integer values.
<series_start>
Starting integer value, inclusive.
BIGINT
<series_end>
Ending integer value, inclusive.
BIGINT
<series_step> (optional, defaults to 1)
Increment to increase or decrease and values that follow. Integer.
BIGINT
generate_series
The integer series specified by the input arguments.
Column<BIGINT>
Example
Generate a series of timestamp values from start_timestamp
to end_timestamp
.
Input Arguments
series_start
Starting timestamp value, inclusive.
TIMESTAMP(9) (Timestamp literals with other precisions will be auto-casted to TIMESTAMP(9) )
series_end
Ending timestamp value, inclusive.
TIMESTAMP(9) (Timestamp literals with other precisions will be auto-casted to TIMESTAMP(9) )
series_step
Time/Date interval signifying step between each element in the returned series.
INTERVAL
Output Columns
generate_series
The timestamp series specified by the input arguments.
COLUMN<TIMESTAMP(9)>
Example
The Admin Portal is a collection of dashboards available in the included information_schema
database in Heavy Immerse. The dashboards display point-in-time information of the HEAVY.AI platform resources and users of the system.
Access to system dashboards is controlled using Immerse privileges; only users with Admin privileges or users/roles with access to the information_schema
database can access the system dashboards.
The information_schema
database and Admin Portal dashboards and system tables are installed when you install or upgrade to HEAVY.AI 6.0. For more detailed information on the tables available in the Admin Portal, see System Tables.
With the Admin Portal, you can see:
Database monitoring and database and web server logs.
Real-time data reporting for the system.
Point-in-time resource metrics and user engagement dashboards.
When you log in to the information_schema
database, you see the Request Logs and Monitoring, System Resources, and User Roles and Permissions dashboards.
By default, the Request Logs and Monitoring dashboard does not appear in the Admin portal. To turn on the dashboard, set the enable-logs-system-tables
parameter to TRUE in heavy.conf and restart the database.
The Request Logs and Monitoring dashboard includes the following charts on three tabs:
Number of Requests
Number of Fatals and Errors
Number of Unique Users
Avg Request Time (ms)
Max Request Time (ms)
Number of Requests per Dashboard
Number of Requests per API
Number of Requests per User
Database Server Logs - Sortable by log timestamp, severity level, message, file location, process ID, query ID, thread ID, and node.
Database Queries - Sortable by log timestamp, query string, execution time, and total time.
Web Server Logs - Sortable by log timestamp, severity, and message.
Web Server Access Logs - Sortable by log timestamp, endpoint, HTTP status, HTTP method, IP address, and response size.
The System Resources dashboard includes the following charts on three tabs:
Databases - Names of all available databases
# of Tables - Total number of tables
# of Dashboards - Total number of dashboards
# of Tables Per Database
# of Dashboards Per Database
Tables - Sortable name, column count, and owner information for all tables.
Dashboards - Sortable name, last update time, and owner information for all databases.
CPU Memory Utilization - Free, used, and unallocated
GPU Memory Utilization - Free, used, and unallocated
Tables with Highest CPU Memory Utilization
Tables with Highest GPU Memory Utilization
Columns with Highest CPU Memory Utilization
Columns with Highest GPU Memory Utilization
Tables with Highest Storage Utilization
Total Used Storage
The User Roles and Permission Dashboard includes the following charts:
# of Users - Total number of users on the system
# of Roles - Total number of roles on the system
# of Table Owners - Total number of table owners on the system
# of Dashboard Owners - Total number of dashboard owners on the system
Users - Sortable list of users on the system
User-Role Assignments - Mapping of role names to user names, sortable by role or user
Roles - Sortable list of roles on the system
Databases - Sortable list of databases on the system
User Permissions - Mapping of user or role name, permission type, and database, sortable by any column.
Following is a list of HEAVY.AI keywords.