1 of 100

v7.2.4

Welcome to HEAVY.AI Documentation

Use of HEAVY.AI is subject to the terms of the HEAVY.AI End User License Agreement (EULA).

What Will I Learn?

For Analysts

Learn how to use Immerse to gain new insights to your data with fast, responsive graphics and SQL queries.

For Administrators

Learn how to Install and configure your HEAVY.AI instance, then load data for analysis.

For Developers and Data Scientists

Learn how to extend HEAVY.AI with an integrated data science foundation and custom charts and interfaces. Contribute to the HEAVY.AI Core Open Source project.

Release Highlights

For more complete release information, see the Release Notes.

Release 7.0

Overview

We are also pleased to announce the general availability of our new backend Executor Resource Manager with CPU / GPU parallelism and query policy controls such as executor type, memory and time limits. We can also now support CPU queries larger than available CPU memory.

This release also features the debut of a user interface for joins in Immerse (beta), supporting inner and left joins which are named and persisted in dashboards. This provides analytic and visualization access to joined columns, complementing the prior table linking function supporting cross-filtering.

Powerful machine learning (beta) and statistical methods (beta) are now available in the database, supporting high performance predictive analytics workflows. For example you can now perform clustering or run linear regression or random forest models on large datasets with interactive inferencing.

Immerse also gains a large set of dashboard refinements, including an optional ‘minimalist’ style with hidden chart titles, and an optional new text chart with full HTML and font controls.

There are several major external dependency updates in this release. With Ubuntu 18 reaching its end of life we now require Ubuntu 20.04. For similar reasons, we now support NVIDIA CUDA version 11.8, which deprecates support for Kepler GPUs. Last but not least, we are formally retiring polygon ‘render groups’ within the database, a change which is not backwards compatible. So full database backups are required as part of this upgrade.

Heavy Immerse

New Features and Improvements

BETA: Joins in Immerse
BETA: Enhanced text chart. The flag `ui/enable_new_text_chart` adds a “text2” chart type, with additional features:
- font family (e.g. arial)
- font sizes, line height
- colors populated from dashboard palette
- html table
- undo/redo
- separator line with styles
- full html support
Added a new “minimal” style mode in which chart titles are hidden by default but appear on rollover. Controlled by feature flag `ui/minimize_chart_size` which defaults to off
Within map chart editor geo layers are now renamable
Role-based access to control panel UX previously requiring admin access.

HeavyML (BETA)

7.0 marks the beta release of HeavyML, a new set of capabilities to execute accelerated machine learning workflows directly from SQL.

General Capabilities and Methods

Named model creation is supported via a new CREATE MODEL statement (see the release notes and documentation for more details)
Row-wise inference (GPU-accelerated for GPU queries) can be performed via a new ML_PREDICT row-wise operator. This can be used as an Immerse custom measure and persisted into dashboards, allowing end-users to consume models without needing to know how to create or administer them.
An EVALUATE model function is provided to test models against metrics (such as r2).
Table functions are provided to access linear regression coefficients for linear regression models and variable importance scores for random forest models.
A new “SHOW MODELS” SQL command allows end users to determine which models are available.
More-detailed model metadata can be accessed by admins with SHOW MODEL DETAILS and in a new ml_models system table in the information_schema database.

Regression Algorithms

Four regression algorithms are supported initially: linear regression, random forest regression, decision trees, and Gradient Boosted Trees (GBT).
Both categorical text and continuous numeric regressors/predictors are supported. Categorical inputs are automatically one-hot-encoded.
Support for continuous variable prediction is initially supported, categorical classification is planned for a later release.

Clustering Algorithms

Two clustering algorithms are supported in this initial release: KMeans and DBScan.
Clustering algorithms can be called via associated table functions (more detail can be found in the relevant documentation), and currently support continuous numeric inputs only.

Performance and Administration

A new Executor Resource Manager (ERM) framework is provided
The ERM allows for CPU queries to run fully in parallel, and one or more CPU queries to run in parallel while a GPU query is executing (parallel GPU query kernel execution is not supported yet).
It also allows execution of CPU queries where the input datasets do not fit into the CPU buffer pool by executing on a fragment-by-fragment basis, paging from storage.
The Executor Resource Manager takes into account the resources needed for each query to schedule them in the most efficient manner.
It is defaulted on, however it can be turned off using the following flag: --enable-executor-resource-mgr=0, which will lead query kernel execution to follow the same serial, pre-7.0 path.

HeavyRF

New Features and Improvements

A new “cell editor” is provided. This supports multi-band antennas mounted within various sites within a cell. Various antenna attributes such as horizontal and vertical falloff can be easily applied based on an extensible library of antenna types.

Vegetation and building envelope attenuation can now be directly or indirectly specified. For example, typical values can be provided as scalar constants, or clutter object-specific attributes can be derived from normal SQL cursor queries. Vegetation attenuation can be tied to measurements of canopy moisture content from remote sensing based on seasonal statistics, or for individual dates to match drive test data. Building attenuation can be driven by various known or inferred characteristics, such as from parcels databases.

The right-hand information panel has been extended to better support targeting of large numbers of buildings. This can be done directly by searching and filtering on building attributes in the HeavyRF application, such as building type or size. But it can also be combined with analyses in Immerse extending to multiple arbitrary tags. For example, a set of locations with high customer value and high potential for churn can be identified in Immerse and tagged with attributes searchable in HeavyRF.

Last but not least, the HeavyRF platform will soon be available on NVIDIA’s LaunchPad. This facilitates initial evaluation of the software by making it immediately available together with appropriate supporting GPU hardware.

Release 6.4

HEAVY.AI continues to refine and extend the data connectors ecosystem. This release features general availability of data connectors for PostGreSQL, beta Immerse connectors for Snowflake and Redshift, and SQL support for Google BigQuery and Hive (beta). These managed data connections let you use HEAVY.AI as an acceleration platform, wherever your source data lives. Scheduling and automated caching ensure that from an end-user perspective, fast analytics are always running on the latest available data.

Immerse features four new chart types: Contour, Cross-section, Wind barb and Skew-t. While these are especially useful for atmospheric and geotechnical data visualization, Contour and Cross-section also have more general application.

Major improvements for time series analysis have been added. This includes time series comparison via window functions, and a large number of SQL window function additions and performance enhancements.

This release also includes two major architectural improvements:

The ability to perform cross-database queries in SQL, increasing flexibility across the board.
Render queries no longer block other GPU queries. In many use cases, renders can be significantly slower than other common queries. This should result in significant performance gains, particularly in map-heavy dashboards.

Release 6.2

Heavy Immerse

Chart animation through cross filter replay, allowing controlled playback of time-based data such as weather maps or GPS tracks.
You can now directly export your charts and dashboards as image files.
New control panel enables administrators to view the configuration of the system and easily access logs and system tables.
HeavyConnect now provides graphical Heavy Immerse support for Redshift, Snowflake, and PostGIS connections.
For CPU-only systems, mapping capabilities are improved with the introduction of multilayer CPU-rendered geo.

General Analytics

Numerous improvements to core SQL and geoSQL capabilities.
Support for string to numeric, timestamp, date, and time types with the new TRY_CAST operator.
Explicit and implicit cast support for numeric, timestamp, date, and time types.
Advanced string functions facilitate extraction of data from JSON and externally encoded string formats.
Improvements to COUNT DISTINCT reduces memory requirements considerably in cases with very large cardinalities or highly skewed data distributions.
Added MULTIPOINT and MULTILINESTRING geo types.
Convex and concave hull operators, allowing generation of polygons from points and multipoints. For example, you could generate polygons from clusters of GPS points.
Syntax and performance optimizations across all geometry types, table orderings, and commonly nested functions.
Significant functionality extension of window functions; define windows directly in temporal terms, which is particularly important in time series with missing observations. Window frame support allows improved control at the edges of windows.

Advanced Analytics

Two new functions now support direct loading of LiDAR data: tf_point_cloud_metadata quickly searches tile metadata and helps you find data to import, and tf_load_point_cloud does the actual import importing.
Network graph analytics functions have been added. These can work on networks alone, including non-geographic networks, or can find the least-cost path along a geographic network.
New spatial aggregation and smoothing functions. Aggregations work particularly well with LiDAR data--for example to pass through only the highest point within an area to create building or canopy height maps. Smoothing helps with noisy datasets and can reveal larger-scale patterns while minimizing visual distractions.

Release 6.1

Release 6.1.0 features more granular administrative monitoring dashboards based on logs. These have been accessible in an open format on the server side, and now they are available in Immerse, by specific dashboards, users, or queries. Intermediate and advanced SQL support continues to mature, with INSERT, window functions, and UNION ALL.

This release contains a number of user interface polish items requested by customers. Cartography now supports polygons with colorful borders and transparent fills. Table presentation has been enhanced in various ways, from alignment to zebra striping. And dashboard saving reminders have been scaled back, based on customer feedback.

The extension framework now features an enhanced “custom source” dialog, as well as new SQL commands to see installed extensions and their parameters. We introduce three new extensions. The first, tf_compute_dwell_times, reduces GPS event stream data volumes considerably while keeping relevant information. The others compute feature similarity scores and are very general.

This release also includes initial public betas of our PostgreSQL Immerse connector, and SQL support for COPY FROM ODBC database connections, making it easier to connect to your enterprise data.

Release 6.0

This release features large advances in data access, including intelligent linking to enterprise data (HeavyConnect) and support for raster geodata. SQL support includes high-performance string functions, as well as enhancements to window functions and table unions. Performance improvements are noticeable across the product, including fundamental advances in rendering, query compilation, and data transport. Our system administration tools have been expanded with a new Admin Portal, as well as additional system tables supporting detailed diagnostics. Major strides in extensibility include new charting options and a new extensions framework (beta).

Name Changes

Rebranded platform from OmniSci to HEAVY.AI, with OmniSciDB now HeavyDB, OmniSci Render now HeavyRender, and OmniSci Immerse now Heavy Immerse.

HeavyConnect and Data Import

HeavyConnect allows the HEAVY.AI platform to work seamlessly as an accelerator for data in other data lakes and data warehouses. For Release 6.0, CSV and Parquet files on local file systems and in S3 buckets can be linked or imported. Other SQL databases are also supported via ODBC (beta).
HeavyConnect enables users to specify a data refresh schedule, which ensures access to up-to-date data.
Heavy Immerse now supports import of dozens of raster data formats, including geoTIFF, geoJPEG , and PNG. HeavySQL now supports most any vector GIS file format.
Support is included for multidimensional arrays common in the sciences, including Grib2, NetCDF, and hd5.
Immerse now supports linking or import of files on the server filesystem (local or mounted). This help prevent slow data transfers when client bandwidth is limited.
File globbing and filtering allow import of thousands of files at once.

Other Immerse Enhancements

New Gauge chart for easy visualization of key metrics relative to target thresholds.
New landing page and Help Center.
Enhanced mapping workflows with automated column picking.

SQL Enhancements

Support for a wide range of performant string operations using a new string dictionary translation framework, as well as the ability to on-the-fly dictionary encode none-encoded strings with a new ENCODE_TEXT operator.
Support for UNION ALL is now enabled by default, with significant performance improvements from the previous release (where it was beta flagged).
Significant functionality and performance improvements for window functions, including the ability to support expressions in PARTITION and ORDER clauses.

Performance

Parallel compilation of queries and a new multi-executor shared code cache provide up to 20% throughput/concurrency gains for interactive usage scenarios.
10X+ performance improvements in many cases for initial join queries via optimized Join Hash Table framework.
New result set recycler allows for expensive query sub-steps to be cached via the SQL hint /*+ keep_result */, which can significantly increase performance when a subquery is used across multiple queries.
Arrow execution endpoints now leverage the parallel execution framework, and Arrow performance has been significantly improved when high-cardinality dictionary-encoded text columns are returned
Introduces a novel polygon rendering algorithm that does not require pre-triangulated or pre-grouped polygons and can render dynamically generated geometry on the fly (via ST_Buffer). The new algorithm is comparable to its predecessor in terms of both performance and memory and enables optimizations and enhancements in future releases.
New binary transport protocol to Heavy Immerse that significantly increases performance and interactivity for large result sets

System Administration

A new Admin Portal provides information on system resources usage and users.
System table support under a new information_schema database, containing 10 new system tables providing system statistics and memory and storage utilization.

Extensibility

New system and user-defined UDF framework (beta), comprising both row (scalar) and table (UDTF) functions, including the ability to define fast UDFs via Numba Python using the RBC framework, which are then inlined into the HeavyDB compiled query code for performant CPU and GPU execution.
System-provided table functions include generate_series for easy numeric series generation, tf_geo_rasterize_slope for fast geospatial binning and slope/aspect computation over elevation data, and others, with more capabilities planned for future releases.
Leveraging the new table function framework, a new HeavyRF module (licensed separate) includes tf_rf_prop and tf_rf_prop_max_signal table functions for fast radio frequency signal propagation analysis and visualization.
New Iframe chart type in Heavy Immerse to allow easier addition of custom chart types. (BETA)

Release 5.10

Row-level security (RLS) can be used by an administrator to apply security filtering to queries run as a user or with a role.
Support for import from dozens of image and raster file types, such as jpeg, png, geotiff, and ESRI grid, including remote files.
Significantly more performant, parallelized window functions, executing up to 10X faster than in Release 5.9.
Automatic use of columnar output (instead of the default row-wise output) for large projections, reducing query times by 5-10X in some cases.
Support for full set of ST_TRANSFORM SRIDs supported by geos/proj4 library.
Support for numerous vector GIS files (100+ formats supported by current GDAL release).
Support for multidimensional array import from formats common in science and meteorology.
Improved Table chart export to access all data represented by a Table chart.
Introduced dashboard-level named custom SQL.

Release 5.9

Significant speedup for POINT and fixed-length array imports and CTAS/ITAS, generally 5-20X faster.
The PNG encoding step of a render request is no longer a blocking step, providing improvement to render concurrency.
Adds support to hide legacy chart types from add/edit chart menu in preparation for future deprecation (defaults to off).
BETA - Adds custom expressions to table columns, allowing for reusable custom dimensions and measures within a single dashboard (defaults to off).
BETA - Adds Crosslink feature with Crosslink Panel UI, allowing crossfilters to fire across different data sources within the same dashboard (defaults to off).
BETA - Adds Custom SQL Source support and Custom SQL Source Manager, allowing the creation of a data source as a SQL statement (defaults to off)

Release 5.8

Parallel execution framework is on by default. Running with multiple executors allows parts of query evaluation, such as code generation and intermediate reductions, to be executed concurrently. Currently available for single-node deployments.
Spatial joins between geospatial point types using the ST_Distance operator are accelerated using the overlaps hash join framework, with speedups up to 100x compared to Release 5.7.1.
Significant performance gains for many query patterns through optimization of query code generation, particularly benefitting CPU queries.
Window functions can now be executed without a partition clause being specified (to signify a partition encompassing all rows in the table).
Window functions can now execute over tables with multiple fragments and/or shards.
Native support for ST_Transform between all UTM Zones and EPSG:4326 (Lon/Lat) and EPSG:900913 (Web Mercator).
ST_Equals support for geospatial columns.
Support for the ANSI SQL WIDTH_BUCKET operator for easier and more performant numeric binning, now also used in Immerse for all numeric histogram visualizations
The Vulkan backend renderer is now enabled by default. The legacy OpenGL renderer is still available as a fallback if there are blocking issues with Vulkan. You can disable the Vulkan renderer using the renderer-use-vulkan-driver=false configuration flag.
- Vulkan provides improved performance, memory efficiency, and concurrency.
- You are likely to see some performance and memory footprint improvements with Vulkan in Release 5.8, most significantly in multi-GPU systems.
Support for file path regex filter and sort order when executing the COPY FROM command.
New ALTER SYSTEM CLEAR commands that enable clearing CPU or GPU memory from Immerse SQL Editor or any other SQL client.

Release 5.7

Extensive enhancements to Immerse support for parameters. Parameters can now be used in chart column selectors, chart filters, chart titles, global filters, and dashboard titles. Dashboards can have parameter widgets embedded on them, side-by-side with charts. Parameter values are visible in chart axes/labels, legends, and tooltips, and you can toggle parameter visibility.
In Immerse Pointmap charts, you can specify which color-by attribute always render on top, which is useful for highlight anomalies in data.
Significantly faster and more accurate "lasso" tool filters geospatial data on Immerse Pointmap charts, leveraging native geospatial intersection operations.
Immerse 3D Pointmap chart and HTML support in text charts are available as a beta feature.
Airplane symbol shape has been added as a built-in mark type for the Vega rendering API.
Vega symbol and multi-GPU polygon renders have been made significantly faster.
User-interrupt of query kernels is now on by default. Queries can be interrupted using Ctrl + C in omnisql, or by calling the interrupt API.
Parallel executors is in public beta (set with --num-executors flag).
Support for APPROX_QUANTILE aggregate.
Support for default column values when creating a table and across all append endpoints, including COPY TO, INSERT INTO TABLE SELECT, INSERT, and binary load APIs.
Faster and more robust ability to return result sets in Apache Arrow format when queried from a remote client (i.e. non-IPC).
More performant and robust high-cardinality group-by queries.
ODBC driver now supports Geospatial data types.

Release 5.6

Custom SQL dimensions, measures, and filters can now be parameterized in Immerse, enabling more flexible and powerful scenario analysis, projections, and comparison use cases.
New angle measure added to Pointmap and Scatter charts, allowing orientation data to be visualized with wedge and arrow icons.
Custom SQL modal with validation and column name display now enabled across all charts in Immerse.
Significantly faster point-in-polygon joins through a new range join hash framework.
Approximate Median function support.
INSERT and INSERT FROM SELECT now support specification of a subset of columns.
Automatic metadata updates and vacuuming for optimizing space usage.
Significantly improved OmniSciDB startup time, as well as a number of significant load and performance improvements.
Improvements to line and polygon stroke rendering and point/symbol rendering.

Release 5.5

Ability to set annotations on New Combo charts for different dimension/measure combinations.
New ‘Arrow-over-the-wire’ capability to deliver result sets in Apache Arrow format, with ~3x performance improvement over Thrift-based result set serialization.
Support for concurrent SELECT and UPDATE/DELETE queries for single-node installations
Initial OmniSci Render support for CPU-only query execution ("Query on CPU, render on GPU"), allowing for a wider set of deployment infrastructure choices.
Cap metadata stored on previous states of a table by using MAX_ROLLBACK_EPOCHS, improving performance for streaming and small batch load use cases and modulating table size on disk

Release 5.4

Added initial compilation support for NVIDIA Ampere GPUs.
Improved performance for UPDATE and DELETE queries.
Improved the performance of filtered group-by queries on large-cardinality string columns.
Added SQL function SAMPLE_RATIO, which takes a proportion between 0 and 1 as an input argument and filters rows to obtain a sampling of a dataset.
Added support for exporting geo data in GeoJSON format.
Dashboard filter functionality is expanded, and filters can be saved as views.
You can perform bulk actions on the dashboard list.
New UI Setting panel in Immerse for customizing charts.
Tabbed dashboards.
SQL Editor now handles Vega JSON requests.

Release 5.3

New Combo chart type in Immerse provides increased configurability and flexibility.
Immerse chart-specific filters and quick filters add increased flexibility and speed.
Updated Immerse Filter panel provides a Simple mode and Advanced mode for viewing and creating filters.
On multilayer charts, layer visibility can be set by zoom level.
Different map charts can be synced together for pan and zoom actions, regardless of data source.
Array support for the Array type over JDBC.
SELECT DISTINCT in UNION ALL is supported. (UNION ALL is prerelease and must be explicitly enabled.
Support for joins on DECIMAL types.
Performance improvements on CUDA GPUs, particularly Volta and Turing.

Release 5.2

NULL support for geospatial types, including in ALTER TABLE ADD COLUMN.
SQL SHOW commands: SHOW TABLES, SHOW DATABASES, SHOW CREATE TABLE, and SHOW USER SESSIONS.
Ability to perform updates and deletes on temporary tables.
Updates to JDBC driver, including escape syntax handling for the fn keyword and added support to get table metadata.
Notable performance improvements, particularly for join queries, projection queries with order by and/or limit, queries with scalar subqueries, and multicolumn group-by queries.
Query interrupt capability improved to allow canceling long-running queries, also supports JDBC now.
Completely overhauled SQL Editor, including query formatting, snippets, history and more.
Database switching from within Immerse, as well as dashboard URLs that contain the database name.
Over 50% reduction in load times for the dashboards list initial load and search.
Cohort builder now supports count (# records) in aggregate filter.
Improved error handling and more meaningful error messages.
Custom logos can now be configured separately for light and dark themes.
Logos can be configured to deep-link to a specific URL.

Release 5.1

Added support for UPDATE via JOIN with a subquery in the WHERE clause.
Initial support for TEMPORARY (that is, non-persistent) tables.
Improved performance for multi-column GROUP BY queries, as well as single column GROUP BY queries with high cardinality. Performance improvement varies depending on data volume and available hardware, but most use cases can expect a 1.5 to 2x performance increase over OmniSciDB 5.0.
Improved support for EXISTS and NOT EXISTS subqueries.
Added support for LINESTRING, POLYGON, and MULTIPOLYGON in user defined functions.
Immerse log-ins are fully sessionized and persist across page refreshes.
Pie chart now supports "All Others" and percentage labels.
Cohorts can now be built with aggregation-based filters.
New filter sets can be created through duplicating existing filter sets.
Dashboard URLs now link to individual filter sets.

Release 5.0

The new filter panel in Immerse enables the ability to toggle filters on and off, and introduces Filter Sets to provide quick access to different sets of filters in one dashboard.
Immerse now supports using global and cross-filters to interactively build cohorts of interest, and the ability to apply a cohort as a dashboard filter, either within the existing filter set or in a new filter set.
Data Catalog, located within Data Import, is a repository of datasets that users can use to enhance existing analyses.
- To see these new features in action, please watch this video from Converge 2019, where Rachel Wang demonstrates how you can use them.
Added support for binary dump and restore of database tables.
Added support for compile-time registered user-defined functions in C++, and experimental support for runtime user-defined SQL functions and table functions in Python via the Remote Backend Compiler.
Support for some forms of correlated subqueries.
Support for update via subquery, to allow for updating a table based on calculations performed on another table.
Multistep queries that generate large, intermediate result sets now execute up to 2.5x faster by leveraging new JIT code generator for reductions and optimized columnarization of intermediate query results.
Frontend-rendered choropleths now support the selection of base map layers.

This sitemap link is for the benefit of the search crawler.

Overview

HEAVY.AI is an analytics platform designed to handle very large datasets. It leverages the processing power of GPUs alongside traditional CPUs to achieve very high performance. HEAVY.AI combines an open-source SQL engine (HeavyDB), server-side rendering (HeavyRender), and web-based data visualization (Heavy Immerse) to provide a comprehensive platform for data analysis.

HeavyDB

The foundation of the platform is HeavyDB, an open-source, GPU-accelerated database. HeavyDB harnesses GPU processing power and returns SQL query results in milliseconds, even on tables with billions of rows. HeavyDB delivers high performance with rapid query compilation, query vectorization, and advanced memory management.

Native SQL

With native SQL support, HeavyDB returns query results hundreds of times faster than CPU-only analytical database platforms. Use your existing SQL knowledge to query data. You can use the standalone SQL engine with the command line, or the SQL editor that is part of the Heavy Immerse visual analytics interface. Your SQL query results can output to Heavy Immerse or to third-party software such as Birst, Power BI, Qlik, or Tableau.

Geospatial Data

HeavyDB can store and query data using native Open Geospatial Consortium (OGC) types, including POINT, LINESTRING, POLYGON, and MULTIPOLYGON. With geo type support, you can query geo data at scale using special geospatial functions. Using the power of GPU processing, you can quickly and interactively calculate distances between two points and intersections between objects.

Open Source

HeavyDB is open source and encourages contribution and innovation from a global community of users. It is available on Github under the Apache 2.0 license, along with components like a Python interface (heavyai) and JavaScript infrastructure (mapd-connector, mapd-charting), making HEAVY.AI the leader in open-source analytics.

HeavyRender

HeavyRender works on the server side, using GPU buffer caching, graphics APIs, and a Vega-based interface to generate custom pointmaps, heatmaps, choropleths, scatterplots, and other visualizations. HEAVY.AI enables data exploration by creating and sending lightweight PNG images to the web browser, avoiding high-volume data transfers. Fast SQL queries make metadata in the visualizations appear as if the data exists on the browser side.

Network bandwidth is a bottleneck for complex chart data, so HEAVY.AI uses in-situ rendering of on-GPU query results to accelerate visual rendering. This differentiates HEAVY.AI from systems that execute queries quickly but then transfer the results to the client for rendering, which slows performance.

Geospatial Analysis

Efficient geospatial analysis requires fast data-rendering of complex shapes on a map. HEAVY.AI can import and display millions of lines or polygons on a geo chart with minimal lag time. Server-side rendering technology prevents slowdowns associated with transferring data over the network to the client. You can select location shapes down to a local level, like census tracts or building footprints, and cross-filter interactively.

Visualize with Vega

Complex server-side visualizations are specified using an adaptation of the Vega Visualization Grammar. Heavy Immerse generates Vega rendering specifications behind the scenes; however, you can also generate custom visualizations using the same API. This customizable visualization system combines the agility of a lightweight frontend with the power of a GPU engine.

Heavy Immerse

Heavy Immerse is a web-based data visualization interface that uses HeavyDB and HeavyRender for visual interaction. Intuitive and easy to use, Heavy Immerse provides standard visualizations, such as line, bar, and pie charts, as well as complex data visualizations, such as geo point maps, geo heat maps, choropleths, and scatter plots. Heavy Immerse provides quick insights and makes them easy to recognize.

Dashboards

Use dashboards to create and organize your charts. Dashboards automatically cross-filter when interacting with data, and refresh with zero latency. You can create dashboards and interact with conventional charts and data tables, as well as scatterplots and geo charts created by HeavyRender. You can also create your own queries in the SQL editor.

Charts

Heavy Immerse lets you create a variety of different chart types. You can display pointmaps, heatmaps, and choropleths alongside non-geographic charts, graphs, and tables. When you zoom into any map, visualizations refresh immediately to show data filtered by that geographic context. Multiple sources of geographic data can be rendered as different layers on the same map, making it easy to find the spatial relationships between them.

Create geo charts with multiple layers of data to visualize the relationship between factors within a geographic area. Each layer represents a distinct metric overlaid on the same map. Those different metrics can come from the same or a different underlying dataset. You can manipulate the layers in various ways, including reorder, show or hide, adjust opacity, or add or remove legends.

Use Multiple Sources

Heavy Immerse can visually display dozens of datasets in the same dashboard, allowing you to find multi-factor relationships that you might not otherwise consider. Each chart (or groups of charts) in a dashboard can point to a different table, and filters are applied at the dataset level. Multisource dashboards make it easier to quickly compare across datasets, without merging the underlying tables.

Streaming Data

Heavy Immerse is ideal for high-velocity data that is constantly streaming; for example, sensor, clickstream, telematics, or network data. You can see the latest data to spot anomalies and trend variances rapidly. Immerse auto-refresh automatically updates dashboards at flexible intervals that you can tailor to your use case.

Ready to Get Started?

I want to...

See...

Install HEAVY.AI

Upgrade to the latest version

Configure HEAVY.AI

See some tutorials and demos to help get up and running

Learn more about charts in Heavy Immerse

Use HEAVY.AI in the cloud

See what APIs work with HEAVY.AI

Learn about features and resolved issues for each release

Know what issues and limitations to look out for

See answers to frequently asked questions

Installation and Configuration

System Requirements

Software Requirements

Operating Systems
- CentOS/RHEL 7.0 or later
- Ubuntu 20.04 or later

Ubuntu 22.04 is not currently supported.

Additional Components
- OpenJDK version 8 or higher
- EPEL
- wget or curl
- Kernel headers
- Kernel development packages
- log4j 2.15.0 or higher
NVIDIA hardware and software (for GPU installs only)
- Hardware: Ampere, Turing, Volta, or Pascal series GPU cards. HEAVY.AI recommends that each GPU card in a server or distributed environment be of the same series.
- Software:
  - NVIDIA CUDA drivers version 520 and Cuda 11.8 or higher. Run nvidia-smi to determine the currently running driver.
  - Up-to-date Vulkan drivers.
Supported web browsers (Enterprise Edition, Immerse). Latest stable release of:
- Chrome
- FireFox
- Safari version 15.x or higher

Some features in Heavy Immerse are not supported in the Internet Explorer browser due to performance issues in IE. HEAVY.AI recommends that you use a different browser to experience the latest Immerse features.

Installation

You can download HEAVY.AI for your preferred platform from .

The CPU (no GPUs) install does not support backend rendering. For example, Pointmap and Scatterplot charts are not available. The GPU install supports all chart types.

The Open Source options do not require a license, and do not include Heavy Immerse.

Free Version

HEAVY.AI Free is a full-featured version of the HEAVY.AI platform available at no cost for non-hosted commercial use.

To get started with HEAVY.AI Free:

Go to the Get Started with HEAVY.AI, and in the HEAVY.AI Free section, click Get Free License.
On the Get HEAVY.AI Free page, enter your email address and click I Agree.
Open the HEAVY.AI Free Edition Activation Link email that you receive from HEAVY.AI, and click Click Here to view and download the free edition license. You will need this license to run HEAVY.AI after you install it. A copy of the license is also sent to your email.
In the What's Next section, click See Install Options to select the best version of HEAVY.AI for your hardware and software configuration. Follow the instructions for the download or cloud version you choose.
Install HEAVY.AI, using the instructions for your platform.
Verify that OmniSci is working correctly by following the instructions in the Checkpoint section at the end of the installation instructions. For example, the Checkpoint instructions for the CentOS CPU with Tarball installation is here.

Add Users

You can create additional HEAVY.AI users to collaborate with.

Connect to Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273.
Open the SQL Editor.
Use the CREATE USER command to create a new user. For information on syntax and options, see CREATE USER.

Installing on CentOS

In this section, you will find recipes to install HEAVY.AI platform and NVIDIA drivers using package manager like yum or tarball.

Installing on Ubuntu

In this section, you will find recipes to install HEAVY.AI platform and NVIDIA drivers using package manager like apt or tarball.

Installing on Docker

Installing OmniSci on Docker

In this section you will find the recipes to install HEAVY.AI platform using Dcoker.

Getting Started on AWS

Getting Started with AWS AMI

You can use the HEAVY.AI AWS AMI (Amazon Web Services Amazon Machine Image) to try HeavyDB and Heavy Immerse in the cloud. Perform visual analytics with the included New York Taxi database, or import and explore your own data.

Many options are available when deploying an AWS AMI. These instructions skip to the specific tasks you must perform to deploy a sample environment.

Prerequisite

You need a security key pair when you launch your HEAVY.AI instance. If you do not have one, create one before you continue.

Go to the EC2 Dashboard.
Select Key Pairs under Network & Security.
Click Create Key Pair.
Enter a name for your key pair. For example, MyKey.
Click Create. The key pair PEM file downloads to your local machine. For example, you would find MyKey.pem in your Downloads directory.

Launching Your Instance

Go to the AWS Marketplace page for HEAVY.AI and select the version you want to use. You can get overview information about the product, see pricing, and get usage and support information.
Click Continue to Subscribe to subscribe.
Read the Terms and Conditions, and then click Continue to Configuration.
Select the Fulfillment Option, Software Version, and Region.
Click Continue to Launch.
On the Launch this software page, select Launch through EC2, and then click Launch.
From the Choose and Instance Type page, select an available EC2 instance type, and click Review and Launch.
Review the instance launch details, and click Launch.
Select a key pair, or click Create a key pair to create a new key pair and download it, and then click Launch Instances.
On the Launch Status page, click the instance name to see it on your EC2 Dashboard Instances page.

Using HEAVY.AI Immerse on Your AWS Instance

To connect to Heavy Immerse, you need your Public IP address and Instance ID for the instance you created. You can find these values on the Description tab for your instance.

To connect to Heavy Immerse:

Point your Internet browser to the public IP address for your instance, on port 6273. For example, for public IP 54.83.211.182, you would use the URL https://54.83.211.182:6273.
If you receive an error message stating that the connection is not private, follow the prompts onscreen to click through to the unsecured website. To secure your site, see Tips for Securing Your EC2 Instance.
1. Enter the USERNAME (admin), PASSWORD ( {Instance ID} ), and DATABASE (heavyai). If you are using the BYOL version, enter you license key in the key field and click Apply.
Click Connect.
On the Dashboards page, click NYC Taxi Rides. Explore and filter the chart information on the NYC Taxis Dashboard.

For more information on Heavy Immerse features, see Introduction to Heavy Immerse.

Importing Your Own Data

Working with your own familiar dataset makes it easier to see the advantages of HEAVY.AI processing speed and data visualization.

To import your own data to Heavy Immerse:

Export your data from your current datastore as a comma-separated value (CSV) or tab-separated value (TSV) file. HEAVY.AI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI.
Point your Internet browser to the public IP address for your instance, on port 6273. For example, for public IP 54.83.211.182, you would use the URL https://54.83.211.182:6273.
Enter the USERNAME (admin) and PASSWORD ( {instance ID} ). If you are using the BYOL version, enter you license key in the key field and click Apply.
Click Connect.
Click Data Manager, and then click Import Data.
Drag your data file onto the table importer page, or use the directory selector.
Click Import Files.
Verify the column names and datatypes. Edit them if needed.
Enter a Name for your table.
Click Save Table.
Click Connect to Table.
On the New Dashboard page, click Add Chart.
Choose a chart type.
Add dimensions and measures as required.
Click Apply.
Enter a Name for your dashboard.
Click Save.

For more information, see Loading Data.

Accessing Your HEAVY.AI Instance Using SSH

Follow these instructions to connect to your instance using SSH from MacOS or Linux. For information on connecting from Windows, see Connecting to Your Linux Instance from Windows Using PuTTY.

Open a terminal window.
Locate your private key file (for example, MyKey.pem). The wizard automatically detects the key you used to launch the instance.
Your key must not be publicly viewable for SSH to work. Use this command to change permissions, if needed:
Connect to your instance using its Public DNS. The default user name is centos or ubuntu, depending on the version you are using. For example:
```
ssh -i MyKey.pem [email protected]
```
Use the following command to run the heavysql SQL command-line utility on HeavyDB. The default user is admin and the default password is { Instance ID }:
```
$HEAVYAI_PATH/bin/heavysql
```
For more information, see heavysql.

Getting Started on GCP

Getting Started with HEAVY.AI on Google Cloud Platform

Follow these instructions to get started with HEAVY.AI on Google Cloud Platform (GCP).

Prerequisites

You must have a Google Cloud Platform account. If you do not have an account, follow these instructions to sign up for one.

To launch HEAVY.AI on Google Cloud Platform, you select and configure an instance.

Launching Your HEAVY.AI Instance

On the solution Launcher Page, click Launch on Compute Engine to begin configuring your deployment.

Before deploying a solution with a GPU machine type, avoid potential deployment failure by checking your available quota for a project to make sure that you have not exceeded your limit.

To launch HEAVY.AI on Google Cloud Platform, you select and configure a GPU-enabled instance.

Search for HEAVY.AI on the heavyai-launcher-public project on Google Cloud Platform, and select a solution. HEAVY.AI has four instance types:
On the solution Launcher Page, click Launch to begin configuring your deployment.
On the new deployment page, configure the following:
- Deployment name
- Zone
- Machine type - Click Customize and configure Cores and Memory, and select Extend memory if necessary.
- GPU type. (Not applicable for CPU configurations.)
- Number of GPUs - (Not applicable for CPU configurations.) Select the number of GPUs; subject to quota and GPU type by region. For more information about GPU-equipped instances and associated resources, see GPU Models for Compute Engine.
- Boot disk type
- Boot disk size in GB
- Networking - Set the Network, Subnetwork, and External IP.
- Firewall - Select the required ports to allow TCP-based connectivity to HEAVY.AI. Click More to set IP ranges for port traffic and IP forwarding.
Accept the GCP Marketplace Terms of Service and click Deploy.
In the Deployment Manager, click the instance that you deployed.
Launch the Heavy Immerse client:
- Record the Admin password (Temporary).
- Click the Site address link to go to the Heavy Immerse login page. Enter the password you recorded, and click Connect.
- Copy your license key from the registration email message. If you have not received your license key, contact your Sales Representative or register for your 30-day trial here.
- Connect to Immerse using a web browser connected to your host machine on port 6273. For example, http://heavyai.mycompany.com:6273.
- When prompted, paste your license key in the text box and click Apply.
- Click Connect to start using HEAVY.AI.
On successful login, you see a list of sample dashboards loaded into your instance.

Getting Started on Azure

Getting Started with HEAVY.AI on Microsoft Azure

Follow these instructions to get started with HEAVY.AI on Microsoft Azure.

Prerequisites

You must have a Microsoft Azure account. If you do not have an account, go to to sign up for one.

Configure Your HEAVY.AI Instance

To launch HEAVY.AI on Microsoft Azure, you configure a GPU-enabled instance.

1) Log in to you Microsoft Azure portal.

2) On the left side menu, create a Resource group, or use one that your organization has created.

3) On the left side menu, click Virtual machines, and then click Add.

4) Create your virtual machine:

On the Basics tab:
- In Project Details, specify the Resource group.
- Specify the Instance Details:
  - Virtual machine name
  - Region
  - Image (Ubuntu 16.04 or higher, or CentOS/RHEL 7.0 or higher)
  - Size. Click Change size and use the Family filter to filter on GPU, based on your use case and requirements. Not all GPU VM variants are available in all regions.
- For Username, add any user name other than admin.
- In Inbound Port Rules, click Allow selected ports and select one or more of the following:
  - HTTP (80)
  - HTTPS (443)
  - SSH (22)
On the Disks tab, select Premium or Standard SSD, depending on your needs.
For the rest of the tabs and sections, use the default values.

5) Click Review + create. Azure reviews your entries, creates the required services, deploys them, and starts the VM.

6) Once the VM is running, select the VM you just created and click the Networking tab.

7) Click the Add inbound button and configure security rules to allow any source, any destination, and destination port 6273 so you can access Heavy Immerse from a browser on that port. Consider renaming the rule to 6273-Immerse or something similar so that the default name makes sense.

8) Click Add and verify that your new rule appears.

Azure-specific configuration is complete. Now, follow standard for your Linux distribution and installation method.

Getting Started on Kubernetes (BETA)

Using HEAVY.AI's Helm Chart on Kubernetes

This documentation outlines how to use HEAVY.AI’s Helm Chart within a Kubernetes environment. It assumes the user is a network administrator within your organization and is an experienced Kubernetes administrator. This is not a beginner guide and does not instruct on Kubernetes installation or administration. It is quite possible you will require additional manifest files for your environment.

Overview

The HEAVY.AI Helm Chart is a template of how to configure deployment of the HEAVY.AI platform. The following files need to be updated/created to reflect the customer's deployment environment.

values.yml
<customer_created>-pv.yml
<customer_created>-pvc.yml

Once the files are updated/created, follow the installation instructions below to install the Helm Chart into your Kubernetes environment.

Where to get the Helm Chart?

The Helm Chart is located in the HEAVY.AI github repository. It can be found here: https://releases.heavy.ai/ee/helm/heavyai-1.0.0.tgz

What’s included?

     Helm-workspace
          ↳heavyai
               ↳Chart.yml
               ↳values.yml
	       ↳templates
	            ↳README.pdf
                    ↳deployment.yml
          ↳misc
               ↳example-heavyai-pv.yml
               ↳example-heavyai-pvc.yml

File Name

Description

Chart.yml

HEAVY.AI Helm Chart. Contains version and contact information.

values.yml

Copy this file and edit values specific to your HEAVY.AI deployment. This is where to note the PVC name. This file is annotated to identify typical customizations and is pre-populated with default values.

README.pdf

These instructions.

deployment.yml

HEAVY.AI platform deployment template. DO NOT EDIT

example-heavyai-pv.yml

Example PV file.

example-heavyai-pv.yml

Example PVC file.

How to install?

Before installing, create a PV/PVC that the deployment will use. Save these files in the regular PVC/PV location used in the customer’s environment. Reference the README.pdf file found in the Helm Chart under templates and the example PV/PVC manifests in the misc folder in the helm chart. The PVC name is then provided to the helm install command.
In your current directory; copy the values.yml file from the HEAVY.AI Helm Chart and customize for your needs.
Run the helm install command with the desired deployment name and Helm Chart.
1. When using a values.yml file:
  $ helm install heavyai --values values.yml heavyaihelmchart-1.0.0.tgz
2. When not using a values.yml file:
  If you only need to change a value or two from the default values.yml file you can use --set instead of a custom values.yml file.
  For example:
  $ helm install heavyai --set pvcName=MyPVCName heavyaihelmchart-1.0.0.tgz

How to uninstall?

To uninstall the helm installed HEAVY.AI instance:

$ helm uninstall heavyai

The PVC and PV space defined for the HEAVY.AI instance is not removed. The retained space must be manually deleted.

Example: values.yml

# Default values for heavyai.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
#
# Version of heavyai to install in the format 'v7.0.0' or 'latest' for the latest version released.
version: v7.0.0
# Persistant volume claim name to use with heavyai.
pvcName: heavyai-pvc
# Namespace to install heavyai in.
nameSpace: heavyai
# Number or GPU's to assign to heavyai or 0 to run the CPU version of heavyai.
gpuNumber: 1
# NodeName to install heavyai on, if you wish to let Kubernetes schedule a host, leave it blank.
nodeName: heavyai-node
# Immerse port redirect of 6273.
hostPortImmerse: 9273
# TCP port redirect of 6274.
hostPortTCP: 9274
# HTTP port redirect of 6278.
hostPortHTTP: 9278

Example: example-heavyai-pvc.yml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: heavyai-pvc
 namespace: heavyai
spec:
 volumeMode: Filesystem
 accessModes:
   - ReadWriteOnce
 resources:
   requests:
     storage: 100Gi
 storageClassName: heavyai

Example: example-heavyai-pv.yml

apiVersion: v1
kind: PersistentVolume
metadata:
 name: heavyai-pv
spec:
 capacity:
   storage: 100Gi
 volumeMode: Filesystem
 accessModes:
   - ReadWriteOnce
 persistentVolumeReclaimPolicy: Retain
 storageClassName: heavyai
 mountOptions:
   - hard
   - nfsvers=4.1
 nfs:
   path: {your nfs path goes here }
   server: { your nfs server name goes here }

Upgrading

In this section, you will find recipes to upgrade from the OmniSci to the HEAVY.AI platform and upgrade between versions of the HEAVY.AI platform.

Supported Upgrade Path

The following table shows the steps needed to move from one version to a later one.

Initial Version

Final Version

Upgrade Steps

Versions 5.x and 6.0.0 are no longer supported; use these only as needed to facilitate an upgrade to a supported version.

Example: if you are running an OmniSci version older than 5.5, you must first upgrade to 5.5, then upgrade to 6.0 and after that upgrade to 7.0. If you are running 6.0 - 6.4, you can upgrade directly to 7.0 in a single step.

Upgrading HEAVY.AI

This section is giving a recipe to upgrade between fully compatible products version.

As with any software upgrade, it is important that you back up your data before upgrading. Each release introduces efficiencies that are not necessarily compatible with earlier releases of the platform. HeavyAI is never expected to be backward compatible.

Back up the contents of your $HEAVYAI_STORAGE directory.

Upgrading from Omnisci

If you need to upgrade from Omnisci to HEAVY.AI 6.0 or later, please refer to the specific recipe.

Direct upgrades from Omnisci to HEAVY.AI version later than 6.0 aren't allowed nor supported.

Upgrading Using Docker

To upgrade HEAVY.AI in place in Docker

In a terminal window, get the Docker container ID.

sudo docker container ps --format "{{.Id}} {{.Image}}" \
-f status=running | grep omnisci\/

You should see output similar to the following. The first entry is the container ID. In this example, it is 9e01e520c30c:

9e01e520c30c omnisci/omnisci-ee-gpu

Stop the HEAVY.AI Docker container. For example:

docker container stop 9e01e520c30c

Optionally, remove the HEAVY.AI Docker container. This removes unused Docker containers on your system and saves disk space.

docker container rm 9e01e520c30c

Backup the Omnisci data directory (typically /var/lib/omnisci)

tar zcvf /backup_dir/omnisci_storage_backup.tar.gz /var/lib/omnisci

Download the latest version of the HEAVY.AI Docker image according to the Edition and device you are actually coming from Select the tab depending on the Edition (Enterprise, Free, or Open Source) and execution Device (GPU or CPU) you are upgrading.

sudo docker run -d --gpus=all \
  -v /var/lib/heavyai:/var/lib/heavyai \
  -p 6273-6278:6273-6278 \
  heavyai/heavyai-ee-cuda:latest

sudo docker run -d -v \
/var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/heavyai-ee-cpu:latest

sudo docker run -d --gpu=all \
  -v /var/lib/heavyai:/var/lib/heavyai \
  -p 6273-6278:6273-6278 \
  heavyai/core-os-cuda:latest

sudo docker run -d -v \
/var/lib/heavyai:/var/lib/heavyai \
-p 6273-6278:6273-6278 \
heavyai/core-os-cpu:latest

If you don't want to upgrade to the latest version but want to upgrade to a specific version, change thelatesttag with the version needed.

If the version needed is the 6.0 use v6.0.0 as the version tag in the image name

heavyai/heavyai-ee-cuda:v6.0.0

Check that the docker is up and running a docker ps commnd:

sudo docker container ps --format "{{.Image}} {{.Status}}" \
-f status=running | grep heavyai\/

You should see an output similar to the following.

heavyai/heavyai-ee-cuda Up 48 seconds ago

This runs both the HEAVY.AI database and Immerse in the same container.

You can optionally add --rm to the Docker run command so that the container is removed when it is stopped.

See also the note regarding the CUDA JIT Cache in Optimizing Performance.

Upgrading HEAVY.AI Using Package Managers and Tarball

To upgrade an existing system installed with package managers or tarball. The commands upgrade HEAVY.AI in place without disturbing your configuration or stored data

Stop the HEAVY.AI services.

sudo systemctl stop heavydb heavy_web_server

Back up your $HEAVYAI_STORAGE directory (the default location is /var/lib/heavyai).

Run the appropriate set of commands depending on the method used to install the previous version of the software.

sudo yum update heavyai.x86_64

sudo apt update
sudo apt upgrade heavyai

Make a backup of your actual installation

sudo mv /opt/heavyai /opt/heavyai_backup

Download and Install the latest version following the install documentation for your Operative System CentOS/RHEL and Ubuntu

When the upgrade is complete, start the HEAVY.AI services.

sudo systemctl start heavydb heavy_web_server

CUDA Compatibility Drivers

This procedure is considered experimental.

In some situations, you might not be able to upgrade NVIDIA CUDA drivers on a regular basis. To work around this issue, NVIDIA provides compatibility drivers that allow users to use newer features without requiring a full upgrade. For information about compatibility drivers, see https://docs.nvidia.com/deploy/cuda-compatibility/index.html.

Installing the Drivers

Use the following commands to install the CUDA 11 compatibility drivers on Ubuntu:

wget 
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin

mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 

apt-key adv --fetch-keys 
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub

add-apt-repository "deb 
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /" 

apt update 

nvidia-smi 

apt install cuda-compat-11-0 

nvidia-smi 

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/compat/ 

nvidia-smi

After the last nvidia-smi, ensure that CUDA shows the correct version.

The driver version will still show as the old version.

Updating systemd Files

After installing the drivers, update the systemd files in /lib/systemd/system/heavydb.service.

In the service section, add or update the environment property

Environment=LD_LIBRARY_PATH=/usr/local/cuda-11.0/compat:$LD_LIBRARY_PATHbash

The file should look like that

[Unit] 
Description=HEAVY.AI database server 
After=network.target remote-fs.target

[Service] 
Environment=LD_LIBRARY_PATH=/usr/local/cuda-11.0/compat:$LD_LIBRARY_PATH
User=heavyai 
Group=heavyai 
WorkingDirectory=/opt/heavyai
ExecStart=/opt/heavyai/bin/heavydb --config /var/lib/heavyai/heavy.conf 
KillMode=control-group 
SuccessExitStatus=143 
LimitNOFILE=65536 
Restart=always

[Install] 
WantedBy=multi-user.target

Then force the reload of the systemd configuration

// Some code

Uninstalling

This is a recipe to permanently remove HEAVY.AI Software, services, and data from your system.

Uninstalling HEAVY.AI from Docker

To uninstall HEAVY.AI in Docker, stop and delete the current Docker container.

In a terminal window, get the Docker container ID:

sudo docker container ps --format "{{.Id}} {{.Image}}" \
-f status=running | grep heavyai\/

You should see an output similar to the following. The first entry is the container ID. In this example, it is 9e01e520c30c:

9e01e520c30c omnisci/omnisci-ee-gpu

To see all containers, both running and stopped, use the following command:

sudo docker container ps -a

Stop the HEAVY.AI Docker container. For example:

sudo docker container stop 9e01e520c30c

Remove the HEAVY.AI Docker container to save disk space. For example:

sudo docker container rm 9e01e520c30c

Uninstalling HEAVY.AI on Redhat and Ubuntu

To uninstall an existing system installed with Yum, Apt, or Tarball connect using the user that runs the platform, typically heavyai.

Disable and stop all HEAVY.AI services.

sudo systemctl disable heavy_web_server --now
sudo systemctl disable heavydb --now

Remove the HEAVY.AI Installation files. (the $HEAVYAI_PATH defaults to /opt/heavyai)

sudo yum remove heavyai.x86_64

sudo apt remove heavyai

sudo rm -r $(readlink $HEAVYAI_PATH) $HEAVYAI_PATH

Delete the configuration files and the storage removing the $HEAVYAI_BASE directory. (defaults to /var/lib/heavyai)

sudo rm  -r $HEAVYAI_BASE

Remove permanently the configuration of the services.

sudo rm /lib/systemd/heavydb*.service
sudo rm /lib/systemd/heavy_web_server*.service
sudo systemctl daemon-reload
sudo systemctl reset-failed

Ports

HEAVY.AI uses the following ports.

Port

Service

Use

6273

heavy_web_server

Used to access Heavy Immerse.

6274

heavydb tcp

Used by connectors (heavyai, omnisql, odbc, and jdbc) to access the more efficient Thrift API.

6276

heavy_web_server

Used to access the HTTP/JSON thrift API.

6278

heavydb http

Used to directly access the HTTP/binary thrift API, without having to proxy through heavy_web_server. Recommended for debugging use only.

Services and Utilities

Using Services

HEAVY.AI features two system services: heavydb and heavy_web_server. You can start these services individually using systemd.

Starting and Stopping HeavyDB Using `systemd`

For permanent installations of HeavyDB, HEAVY.AI recommends that you use systemd to manage HeavyDB services. systemd automatically handles tasks such as log management, starting the services on restart, and restarting the services if there is a problem.

In addition, systemd manages the open-file limit in Linux. Some cloud providers and distributions set this limit too low, which can result in errors as your HEAVY.AI environment and usage grow. For more information about adjusting the limits on open files, see Why am I seeing the error "Too many open files...erno24" in the Troubleshooting and Monitoring Solutions section of our knowledgebase.

Initial Setup

You use the install_heavy_systemd.sh script to prepare systemd to run HEAVY.AI services. The script asks questions about your environment, then installs the systemd service files in the correct location. You must run the script as the root user so that the script can perform tasks such as creating directories and changing ownership.

cd $HEAVYAI_PATH/systemd
sudo ./install_heavy_systemd.sh

The install_heavy_systemd.sh script asks for the information described in the following table.

Variable

Use

Default

Notes

HEAVYAI_PATH

Path to HeavyDB installation directory

Current install directory

HEAVY.AI recommends heavyai as the install directory.

HEAVYAI_BASE

Path to the storage directory for HeavyDB data and configuration files

heavyai

Must be dedicated to HEAVY.AI. The installation script creates the directory $HEAVYAI_STORAGE/data, generates an appropriate configuration file, and saves the file as $HEAVYAI_STORAGE/heavy.conf.

HEAVYAI_USER

User HeavyDB is run as

Current user

User must exist before you run the script.

HEAVYAI_GROUP

Group HeavyDB is run as

Current user's primary group

Group must exist before you run the script.

Starting HeavyDB Using `systemd`

To manually start HeavyDB using systemd, run:

sudo systemctl start heavydb
sudo systemctl start heavy_web_server

Restarting HeavyDB Using `systemd`

You can use systemd to restart HeavyDB — for example, after making configuration changes:

sudo systemctl restart heavydb
sudo systemctl restart heavy_web_server

Stopping HeavyDB Using `systemd`

To manually stop HeavyDB using systemd, run:

sudo systemctl stop heavydb
sudo systemctl stop heavy_web_server

Enabling HeavyDB on Startup

To enable the HeavyDB services to start on restart, run:

sudo systemctl enable heavydb
sudo systemctl enable heavy_web_server

Using Configuration Parameters

You can customize the behavior of your HEAVY.AI servers by modifying your heavy.conf configuration file. See Configuration Parameters.

Using Utilities

HeavyDB includes the utilities initdb for database initialization and generate_cert for generating certificates and private keys for an HTTPS server.

initdb

Before using HeavyDB, initialize the data directory using initdb:

initdb [-f | --skip-geo] $HEAVYAI_BASE/storage

This creates three subdirectories:

catalogs: Stores HeavyDB catalogs
data: Stores HeavyDB data
log: Contains all HeavyDB log files.
disk_cache: Stores the data cached by HEAVY COnnect

The -f flag forces initdb to overwrite existing data and catalogs in the specified directory.

By default, initdb adds a sample table of geospatial data. Use the --skip-geo flag if you prefer not to load sample geospatial data.

generate_cert

generate_cert [{-ca} <bool>]
              [{-duration} <duration>]
              [{-ecdsa-curve} <string>]
              [{-host} <host1,host2>]
              [{-rsa-bits} <int>]
              [{-start-date} <string>]

This command generates certificates and private keys for an HTTPS server. The options are:

[{-ca} <bool>]: Whether this certificate should be its own Certificate Authority. The default is false.
[{-duration} <duration>]: Duration that the certificate is valid for. The default is 8760h0m0s.
[{-ecdsa-curve} <string>]: ECDSA curve to use to generate a key. Valid values are P224, P256, P384, P521.
[{-host} <string>]: Comma-separated hostnames and IPs to generate a certificate for.
[{-rsa-bits} <int>]: Size of RSA key to generate. Ignored if –ecdsa-curve is set. The default is 2048.
[{-start-date} <string>]: Start date formatted as Jan 1 15:04:05 2011

Executor Resource Manager

Overview

To enable concurrent execution of queries, we introduce the concept of an Executor Resource Manager (ERM). This keeps track of compute and memory resources to gate query execution and ensures that compute resources are not over-subscribed. As of version 7.0, ERM is enabled by default.

The ERM evaluates several kinds of resources required by a query. Currently this includes CPU cores, GPUs, buffer and result set memory. It will leverage all available resources unless policy limits have been established, such as for maximum memory use or query time. It determines both the ideal/maximum amount of resources desirable for optimal performance and the minimum required. For example, a CPU query scanning 8 fragments could run with up to 8 threads, but could execute with as little as a single CPU thread with correspondingly less memory if needed.

The ERM establishes a request queue. On every new request, as well as every time an existing request is completed, it checks available resources and picks the next resource request to grant. It currently always gives preference to earlier requests if resources permit launching them (first in, first out, or “FIFO”).

If the system-level multi-executor flag is enabled, the ERM will allow multiple queries to execute at once so long as resources are available. Currently, multiple execution is allowed for CPU queries (and multiple CPU queries and a single GPU query). This supports significant throughput gains by allowing inter-query-kernel concurrency, in addition to the major win of not having a long-running CPU query block the queue for other CPU queries or interactive GPU queries. The number of queries that can be run in parallel is limited by the number of executors

Use of CPU and GPU

By default, if HeavyDB is compiled to run on GPUs and if GPUs are available, query steps/kernels will execute on GPU UNLESS:

Some operations in the query step cannot run on GPU. Operators like MODE, APPROX_MEDIAN/PERCENTILE, and certain string functions are examples.
Update and delete queries currently run on CPU.
The query step requires more memory than available on GPU, but less than available on CPU.
A user explicitly requests their query run on CPU, either via setting a session flag or via a query hint.

At the instance level, this behavior can be configured with system flags on startup. For example a system with GPU can be configured to use only CPU using the cpu-only flag. Or the system use of CPU RAM can be controlled using cpu-buffer-mem-bytes. Execution can also be routed to different device types with query hints such as “SELECT /*+ cpu_mode */ …” These controls do not require the ERM but are platform-wide.

Example Use Cases

Example 1: (no tuning required)

In a scenario where the system hasn’t enough memory available for the cpu-cache or the cache itself is too fragmented to accommodate all the columns’ chunks into cpu-caches, the EMR instead of failing the query with an OOM error, will

run the query reading a single chunk at a time and moving data to GPU caches for a GPU execution.
in case that there isn’t enough GPU memory will run the query chunk by chunk in CPU mode. In this case the query will run slower, but this will free up the GPU executor for less memory demanding queries.

Example 2: (minimal tuning required)

You are deploying a new dashboard or chart which doesn’t require big data or high performance, and so you prefer to run it just on CPU. This way it doesn’t interfere with other performance-critical dashboards or charts.

Set the dashboard chart execution to CPU using query hints. Instead of referencing data directly, set a new “custom data source.” For example, if your data is in a table called ‘mydata’, In the custom source, after your SELECT keyword, add the CPU query hint: You can repeat this for a data source supporting any number of charts desired, including all charts.
Bump up the number of executors (default 4) to 6-8. With more executors free, the dashboard will perform better, without impacting the performance of the other dashboards.

Example 3: (some tuning required)

Improving performance of memory-intensive operations like high cardinality aggregates.

A user conducting exact “count distinct” operations on large datasets, with high cardinality that are likely to be run on CPU, on a server having many CPU cores might employ the following strategy:

Increase the number of executors (default 4) to 8-16. --num-executors=16
Limit CPU total memory use using --cpu-buffer-mem-bytes from default 80% to make some room for large result sets, that now are limited by the executor-cpu-result-mem-ratio.

If those query have sparse value or and high cardinality and are using a wide count distinct will be pushed to CPU execution. Change the executor-per-query-max-cpu-threads-ratio parameter to lower the number of cores that will run a single query; doing that the groupby buffers will be built in a faster way, lowering the memory footprint and speeding up the runtime of query.

Configuration Parameters

Overview

HEAVY.AI has minimal configuration requirements with a number of additional configuration options. This topic describes the required and optional configuration changes you can use in your HEAVY.AI instance.

In release 4.5.0 and higher, HEAVY.AI requires that all configuration flags used at startup match a flag on the HEAVY.AI server. If any flag is misspelled or invalid, the server does not start. This helps ensure that all settings are intentional and will not have an unexpected impact on performance or data integrity.

Storage Directory

Before starting the HEAVY.AI server, you must initialize the persistent storage directory. To do so, create an empty directory at the desired path, such as /var/lib/heavyai.

Create the environment variable $HEAVYAI_BASE.

2. Then, change the owner of the directory to the user that the server will run as ($HEAVYAI_USER):

where $HEAVYAI_USER is the system user account that the server runs as, such as heavyai, and $HEAVYAI_BASE is the path to the parent of the HEAVY.AI server storage directory.

3. Run $HEAVYAI_PATH/bin/initheavy with the storage directory path as the argument:

Configuring a Custom Heavy Immerse Subdirectory

Immerse serves the application from the root path (/) by default. To serve the application from a sub-path, you must modify the $HEAVYAI_PATH/frontend/app-config.js file to change the IMMERSE_PATH_PREFIX value. The Heavy Immerse path must start with a forward slash (/).

Configuration File

The configuration file stores runtime options for your HEAVY.AI servers. You can use the file to change the default behavior.

The heavy.conf file is stored in the $HEAVYAI_BASE directory. The configuration settings are picked up automatically by the sudo systemctl start heavydb and sudo systemctl start heavy_web_server commands.

Set the flags in the configuration file using the format <flag> = <value>. Strings must be enclosed in quotes.

The following is a sample configuration file. The entry for data path is a string and must be in quotes. The last entry in the first section, for null-div-by-zero, is the Boolean value true and does not require quotes.

To comment out a line in heavy.conf, prepend the line with the pound sign (#) character.

For encrypted backend connections, if you do not use a configuration file to start the database, Calcite expects passwords to be supplied through the command line, and calcite passwords will be visible in the processes table. If a configuration file is supplied, then passwords must be supplied in the file. If they are not, Calcite will fail.

Security

Implementing a Secure Binary Interface

Follow these instructions to start an HEAVY.AI server with an encrypted main port.

Required PKI Components

You need the following PKI (Public Key Infrastructure) components to implement a Secure Binary Interface.

A CRT (short for certificate) file containing the server's PKI certificate. This file must be shared with the clients that connect using encrypted communications. Ideally, this file is signed by a recognized certificate issuing agency.
A key file containing the server's private key. Keep this file secret and secure.
A Java TrustStore containing the server's PKI certificate. The password for the trust store is also required.

Although in this instance the trust store contains only information that can be shared, the Java TrustStore program requires it to be password protected.

A Java KeyStore and password.
In a distributed system, add the configuration parameters to the heavyai.conf file on the aggregator and all leaf nodes in your HeavyDB cluster.

Demonstration Script to Create "Mock/Test" PKI Components

You can use OpenSSL utilities to create the various PKI elements. The server certificate in this instance is self-signing, and should not be used in a production system.

Generate a new private key.
Use the private key to generate a certificate signing request.
Self sign the certificate signing request to create a public certificate.
Use the Java tools to create a key store from the public certificate.

To generate a keystore file from your server key:

Copy server.key to server.txt. Concatenate it with server.crt.
Use server.txt to create a PKCS12 file.
Use server.p12 to create a keystore.

Start the Server in Encrypted Mode with PKI Client Authentication

Start the server using the following options.

Example

Configuring heavyai.conf for Encrypted Connection

Alternatively, you can add the following configuration parameters to heavyai.conf to establish a Secure Binary Interface. The following configuration flags implement the same encryption shown in the runtime example above:

Passwords for the SSL truststore and keystore can be enclosed in single (') or double (") quotes.

Why Use Both server.crt and a Java TrustStore?

The server.crt file and the Java truststore contain the same public key information in different formats. Both are required by the server to establish both the secure client communication with the various interfaces and with its Calcite server. At startup, the Java truststore is passed to the Calcite server for authentication and to encrypt its traffic with the HEAVY.AI server.

Encrypted Credentials in Custom Applications

HEAVY.AI can accept a set of encrypted credentials for secure authentication of a custom application. This topic provides a method for providing an encryption key to generate encrypted credentials and configuration options for enabling decryption of those encrypted credentials.

Generating an Encryption Key

Generate a 128- or 256-bit encryption key and save it to a file. You can use https://acte.ltd/utils/randomkeygen to generate a suitable encryption key.

Configuring the Web Server

Set the file path of the encryption key file to the encryption-key-file-path web server parameter in heavyai.conf:

[web]
encryption-key-file-path = “path/to/file”

Alternatively, you can set the path using the --encryption-key-file-path=path/to/file command-line argument.

Generating Encrypted Credentials

Generate encrypted credentials for a custom application by running the following Go program, replacing the example key and credentials strings with an actual key and actual credentials. You can also run the program in a web browser at https://play.golang.org/p/nNBsZ8dhqr0.

package main

import (
    "crypto/aes"
    "crypto/cipher"
    "crypto/rand"
    
    "fmt"
    "io")
    
// 1. Replace example key with encryption string
var key = "v9y$B&E(H+MbQeThWmZq4t7w!z%C*F-J"

// 2. Replace strings "username", "password", "dbName"with credentials
var stringsToBeEncrypted = []string{
    "username",
    "password",
    "dbName",
}

// 3. Run program to see encrypted credentials in console
func main() {
    for i := range stringsToBeEncrypted {
        encrypted, err := EncryptString(stringsToBeEncrypted[i])
        if err != nil {
            panic(err)
        }
        fmt.Printf("%s => %s\n", stringsToBeEncrypted[i],encrypted)
    }
}

func EncryptString(str string) (encrypted string,err error) {
    keyBytes := []byte(key)
    
    block, err := aes.NewCipher(keyBytes)
    if err != nil {
        panic(err.Error())
    }
    aesGCM, err := cipher.NewGCM(block)
    if err != nil {
        panic(err.Error())
    }
    nonce := make([]byte, aesGCM.NonceSize())
    if _, err = io.ReadFull(rand.Reader, nonce); err!= nil {
        panic(err.Error())
    }
    strBytes := []byte(str)
    
    cipherBytes := aesGCM.Seal(nonce, nonce, strBytes,nil)
    
    return fmt.Sprintf("%x", cipherBytes), err
}

LDAP Integration

HEAVY.AI supports LDAP authentication using an IPA Server or Microsoft Active Directory.

You can configure HEAVY.AI Enterprise edition to map LDAP roles 1-to-1 to HEAVY.AI roles. When you enable this mapping, LDAP becomes the main authority controlling user roles in HEAVY.AI.

LDAP mapping is available only in HEAVY.AI Enterprise edition.

HEAVY.AI supports five configuration settings that allow you to integrate with your LDAP server.

Parameter

Description

Example

ldap-uri

LDAP server host or server URI.

ldap://myLdapServer.myCompany.com

ldap-dn

LDAP distinguished name (DN).

uid=$USERNAME,cn=users,cn=accounts, dc=myCompany,dc=com

ldap-role-query-url

Returns the role names a user belongs to in the LDAP.

ldap://myServer.myCompany.com/uid=$USERNAME, cn=users, cn=accounts,dc=myCompany,dc=com?memberOf

ldap-role-query-regex

Applies a regex filter to find matching roles from the roles in the LDAP server.

(MyCompany_.*?),

ldap-superuser-role

Identifies one of the filtered roles as a superuser role. If a user has this filtered ldap role, the user is marked as a superuser.

MyCompany_SuperUser

Obtaining Credential Information

To find the ldap-role-query-url and ldap-role-query-regex to use, query your user roles. For example, if there is a user named kiran on the IPA LDAP server ldap://myldapserver.mycompany.com, you could use the following curl command to get the role information:

$ curl --user "uid=kiran,cn=users,cn=accounts,dc=mycompany,dc=com" 
"ldap://myldapserver.mycompany.com/uid=kiran,cn=users,cn=accounts,dc=mycompany,dc=com?memberOf"

When successful, it returns information similar to the following:

DN: uid=kiran,cn=users,cn=accounts,dc=mycompany,dc=com
memberOf: cn=ipausers,cn=groups,cn=accounts,dc=mycompany,dc=com
memberOf: cn=MyCompany_SuperUser,cn=roles,cn=accounts,dc=mycompany,dc=com
memberOf: cn=test,cn=groups,cn=accounts,dc=mycompany,dc=com

ldap-dn matches the DN, which is uid=kiran,cn=users,cn=accounts,dc=mycompany,dc=com.
ldap-role-query-url includes the LDAP URI + the DN + the LDAP attribute that represents the role/group the member belongs to, such as memberOf.
ldap-role-query-regex is a regular expression that matches the role names. The matching role names are used to grant and revoke privileges in HEAVY.AI. For example, if we created some roles on an IPA LDAP server where the role names begin with MyCompany_ (for example, MyCompany_Engineering, MyCompany_Sales, MyCompany_SuperUser), the regular expression can filter the role names using MyCompany_.
ldap-superuser-role is the role/group name for HEAVY.AI users who are superusers once they log on to the HEAVY.AI database. In this example, the superuser role name is MyCompany_SuperUser.

Make sure that LDAP configuration appears before the [web] section of heavy.conf.

Double quotes are not required for LDAP properties in heavy.conf. For example, both of the following are valid:

ldap-uri = "ldap://myldapserver.mycompany.com" ldap-uri = ldap://myldapserver.mycompany.com

Setting Up LDAP with HEAVY.AI

To integrate LDAP with HEAVY.AI, you need the following:

A functional LDAP server, with all users/roles/groups created (ldap-uri, ldap-dn, ldap-role-query-url, ldap-role-query-regex, and ldap-superuser-role) to be used by HEAVY.AI. You can use the curl command to test and find the filters.
A functional HEAVY.AI server, version 4.1 or higher.

Once you have your server information, you can configure HEAVY.AI to use LDAP authentication.

Locate the heavy.conf file and edit it to include the LDAP parameter. For example:

ldap-uri = "ldap://myldapserver.mycompany.com"
ldap-dn = "uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com"
ldap-role-query-url = "ldap://myldapserver.mycompany.com/uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com?memberOf"
ldap-role-query-regex = "(MyCompany_.*?),"
ldap-superuser-role = "MyCompany_SuperUser"

Restart the HEAVY.AI server:

sudo systemctl restart heavyai_server
sudo systemctl restart heavyai_web_server

Log on to heavysql as MyCompany user, or any user who belongs to one of the roles/groups that match the filter.

When you use LDAP authentication, the default admin user and password HyperInteractive do not work unless you create the admin user with the same password on the LDAP server.

If your login fails, inspect $HEAVYAI_STORAGE/mapd_log/heavyai_server.INFO to check for any obvious errors about LDAP authentication.

Once you log in, you can create a new role name in heavysql, and then apply GRANT/REVOKE privileges to the role. Log in as another user with that role and confirm that GRANT/REVOKE works.

If you refresh the browser window, you are required to log in and reauthenticate.

Using LDAPS

To use LDAPS, HEAVY.AI must trust the LDAP server's SSL certificate. To achieve this, you must have the CA for the server's certificate, or the server certificate itself. Install the certificate as a trusted certificate.

IPA on CentOS

To use IPA as your LDAP server with HEAVY.AI running on CentOS 7:

Copy the IPA server CA certificate to your local machine.

scp root@myldapserver:/etc/ipa/ca.crt /etc/pki/ca-trust/source/anchors/ipa-ca.pem

Update the PKI certificates.
```
update-ca-trust
```
Edit /etc/openldap/ldap.conf to add the following line.
```
TLS_CACERT      /etc/pki/tls/certs/ca-bundle.crt
```

Locate the heavy.conf file and edit it to include the LDAP parameter. For example:

ldap-uri = "ldaps://myldapserver.mycompany.com"
ldap-dn = "uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com"
ldap-role-query-url = "ldaps://myldapserver.mycompany.com/uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com?memberOf"
ldap-role-query-regex = "(MyCompany_.*?),"
ldap-superuser-role = "MyCompany_SuperUser"

Restart the HEAVY.AI server:

sudo systemctl restart heavyaidb
sudo systemctl restart heavyai_web_server

IPA on Ubuntu

To use IPA as your LDAP server with HEAVY.AI running on Ubuntu:

Copy the IPA server CA certificate to your local machine.

mkdir /usr/local/share/ca-certificates/ipa
scp root@myldapserver:/etc/ipa/ca.crt /usr/local/share/ca-certificates/ipa/ipa-ca.pem

Rename ipa-ca.crm to ipa-ca.crt so that the certificates bundle update script can find it:

mv /usr/local/share/ca-certificates/ipa/ipa-ca.pem /usr/local/share/ca-certificates/ipa/ipa-ca.crt

Update the PKI certificates:
```
update-ca-certificates
```
Edit /etc/openldap/ldap.conf to add the following line:
```
TLS_CACERT      /etc/ssl/certs/ca-certificates.crt
```

Locate the heavy.conf file and edit it to include the LDAP parameter. For example:

ldap-uri = "ldaps://myldapserver.mycompany.com"
ldap-dn = "uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com"
ldap-role-query-url = "ldaps://myldapserver.mycompany.com/uid=$USERNAME,cn=users,cn=accounts,dc=mycompany,dc=com?memberOf"
ldap-role-query-regex = "(MyCompany_.*?),"
ldap-superuser-role = "MyCompany_SuperUser"

Restart the HEAVY.AI server:

sudo systemctl restart heavydb
sudo systemctl restart heavyai_web_server

Active Directory

1. Locate the heavy.conf file and edit it to include the LDAP parameter.

Example 1:

ldap-uri = "ldap://myldapserver.mycompany.com"
ldap-dn = "cn=$USERNAME,cn=users,dc=qa-mycompany,dc=com"
ldap-role-query-url = "ldap:///myldapserver.mycompany.com/cn=$USERNAME,cn=users,dc=qa-mycompany,dc=com?memberOf"
ldap-role-query-regex = "(HEAVYAI_.*?),"
ldap-superuser-role = "HEAVYAI_SuperUser"

Example 2:

ldap-uri = "ldap://myldapserver.mycompany.com"
ldap-dn = "[email protected]"
ldap-role-query-url = "ldap:///myldapserver.mycompany.com/OU=MyCompany Users,dc=MyCompany,DC=com?memberOf?sub?(sAMAccountName=$USERNAME)"
ldap-role-query-regex = "(HEAVYAI_.*?),"
ldap-superuser-role = "HEAVYAI_SuperUser"

2. Restart the HEAVY.AI server:

sudo systemctl restart heavyai_server
sudo systemctl restart heavyai_web_server

Other LDAP user authentication attributes, such as userPrincipalName, are not currently supported.

Loading and Exporting Data

Supported Data Sources

Kafka

is a distributed streaming platform. It allows you to create publishers, which create data streams, and consumers, which subscribe to and ingest the data streams produced by publishers.

You can use HeavyDB C++ program to consume a topic created by running Kafka shell scripts from the command line. Follow the procedure below to use a Kafka producer to send data, and a Kafka consumer to store the data, in HeavyDB.

This example assumes you have already installed and configured Apache Kafka. See the .

Creating a Topic

Create a sample topic for your Kafka producer.

Run the kafka-topics.sh script with the following arguments:
Create a file named myfile that consists of comma-separated data. For example:
Use heavysql to create a table to store the stream.

Using the Producer

Load your file into the Kafka producer.

Create and start a producer using the following command.

Using the Consumer

Load the data to HeavyDB using the Kafka console consumer and the KafkaImporter program.

Pull the data from Kafka into the KafkaImporter program.
Verify that the data arrived using heavysql.

Command Line

Exporting Data

COPY TO

COPY ( <SELECT statement> ) TO '<file path>' [WITH (<property> = value, ...)];

<file path> must be a path on the server. This command exports the results of any SELECT statement to the file. There is a special mode when <file path> is empty. In that case, the server automatically generates a file in <HEAVY.AI Directory>/export that is the client session id with the suffix .txt.

Available properties in the optional WITH clause are described in the following table.

Parameter

Description

Default Value

array_null_handling

Define how to export with arrays that have null elements:

'abort' - Abort the export. Default.
'raw' - Export null elements as raw values.
'zero' - Export null elements as zero (or an empty string).
'nullfield' - Set the entire array column field to null for that row.

Applies only to GeoJSON and GeoJSONL files.

'abort'

delimiter

A single-character string for the delimiter between column values; most commonly:

, for CSV files
\t for tab-delimited files

Other delimiters include | ,~, ^, and;.

Applies to only CSV and tab-delimited files.

Note: HEAVY.AI does not use file extensions to determine the delimiter.

',' (CSV file)

escape

A single-character string for escaping quotes. Applies to only CSV and tab-delimited files.

' (quote)

file_compression

File compression; can be one of the following:

'none'
'gzip'
'zip'

For GeoJSON and GeoJSONL files, using GZip results in a compressed single file with a .gz extension. No other compression options are currently available.

'none'

file_type

Type of file to export; can be one of the following:

'csv' - Comma-separated values file.
'geojson' - FeatureCollection GeoJSON file.
'geojsonl' - Multiline GeoJSONL file.
'shapefile' - Geospatial shapefile.

For all file types except CSV, exactly one geo column (POINT, LINESTRING, POLYGON or MULTIPOLYGON) must be projected in the query. CSV exports can contain zero or any number of geo columns, exported as WKT strings.

Export of array columns to shapefiles is not supported.

'csv'

header

Either 'true' or 'false', indicating whether to output a header line for all the column names. Applies to only CSV and tab-delimited files.

'true'

layer_name

A layer name for the geo layer in the file. If unspecified, the stem of the given filename is used, without path or extension.

Applies to all file types except CSV.

Stem of the filename, if unspecified

line_delimiter

A single-character string for terminating each line. Applies to only CSV and tab-delimited files.

'\n'

nulls

A string pattern indicating that a field is NULL. Applies to only CSV and tab-delimited files.

An empty string, 'NA', or \N

quote

A single-character string for quoting a column value. Applies to only CSV and tab-delimited files.

" (double quote)

quoted

Either 'true' or 'false', indicating whether all the column values should be output in quotes. Applies to only CSV and tab-delimited files.

'true'

When using the COPY TO command, you might encounter the following error:

Query couldn’t keep the entire working set of columns in GPU Memory.

To avoid this error, use the heavysql command \cpu to put your HEAVY.AI server in CPU mode before using the COPY TO command. See Configuration.

Example

COPY (SELECT * FROM tweets) TO '/tmp/tweets.csv';
COPY (SELECT * tweets ORDER BY tweet_time LIMIT 10000) TO
  '/tmp/tweets.tsv' WITH (delimiter = '\t', quoted = 'true', header = 'false');

SQL

Data Definition (DDL)

Views

DDL - Views

A view is a virtual table based on the result set of a SQL statement. It derives its fields from a SELECT statement. You can do anything with a HEAVY.AI view query that you can do in a non-view HEAVY.AI query.

Nomenclature Constraints

View object names must use the NAME format, described in regex notation as:

[A-Za-z_][A-Za-z0-9\$_]*

CREATE VIEW

Creates a view based on a SQL statement.

Example

CREATE VIEW view_movies
AS SELECT movies.movieId, movies.title, movies.genres, avg(ratings.rating)
FROM ratings
JOIN movies on ratings.movieId=movies.movieId
GROUP BY movies.title, movies.movieId, movies.genres;

You can describe the view as you would a table.

\d view_movies
VIEW defined AS: SELECT  movies.movieId, movies.title, movies.genres,
avg(ratings.rating) FROM ratings JOIN movies ON ratings.movieId=movies.movieId
GROUP BY movies.title, movies.movieId, movies.genres
Column types:
    movieId INTEGER,
    title TEXT ENCODING DICT(32),
    genres TEXT ENCODING DICT(32),
    EXPR$3 DOUBLE

You can query the view as you would a table.

SELECT title, EXPR$3 from view_movies where movieId=260;
Star Wars: Episode IV - A New Hope (1977)|4.048937

DROP VIEW

Removes a view created by the CREATE VIEW statement. The view definition is removed from the database schema, but no actual data in the underlying base tables is modified.

Example

DROP VIEW IF EXISTS v_reviews;

Policies

You can use policies to provide row-level security (RLS) in HEAVY.AI.

CREATE POLICY

Create an RLS policy for a user or role (<name>); admin rights are required. All queries on the table for the user or role are automatically filtered to include only rows where the column contains any one of the values from the VALUES clause.

RLS filtering works similarly to a WHERE column = value clause, appended to every query or subquery on the table, would work. If policies on multiple columns in the same table are defined for a user or role, then a row is visible to that user or role if any one or more of the policies matches that row.

DROP POLICY

Drop an RLS policy for a user or role (<name>); admin rights are required. All values specified for the column by the policy are dropped. Effective values from another policy on an inherited role are not dropped.

SHOW POLICIES

Displays a list of all RLS policies that exist for a user or role. If EFFECTIVE is used, the list also include any policies that exist for all roles that apply to the requested user or role.

Data Manipulation (DML)

SQL Capabilities

ALTER SESSION SET

Change a parameter value for the current session.

Paremeter name

Values

Alter Session Examples

CURRENT_DATABASE

Switch to another database without need of re-login.

Your session will silently switch to the requested database.

The database exists, but the user does not have access to it:

The database does not exist:

EXECUTOR_DEVICE

Force the session to run the subsequent SQL commands in CPU mode:

Switch back the session to run in GPU mode

ALTER SYSTEM CLEAR

Clear CPU, GPU, or RENDER memory. Available to super users only.

ALTER SYSTEM CLEAR (CPU|GPU|RENDER) MEMORY

Examples

ALTER SYSTEM CLEAR CPU MEMORY

ALTER SYSTEM CLEAR GPU MEMORY

ALTER SYSTEM CLEAR RENDER MEMORY

Generally, the server handles memory management, and you do not need to use this command. If you are having unexpected memory issues, try clearing the memory to see if performance improves.

DELETE

Deletes rows that satisfy the WHERE clause from the specified table. If the WHERE clause is absent, all rows in the table are deleted, resulting in a valid but empty table.

Cross-Database Queries

In Release 6.4 and higher, you can run DELETE queries across tables in different databases on the same HEAVY.AI cluster without having to first connect to those databases.

To execute queries against another database, you must have ACCESS privilege on that database, as well as DELETE privilege.

Example

Delete rows from a table in the my_other_db database:

INSERT

Use INSERT for both single- and multi-row ad hoc inserts. (When inserting many rows, use the more efficient command.)

Examples

You can also insert into a table as SELECT, as shown in the following examples:

You can insert array literals into array columns. The inserts in the following example each have three array values, and demonstrate how you can:

Create a table with variable-length and fixed-length array columns.
Insert NULL arrays into these colums.
Specify and insert array literals using {...} or ARRAY[...] syntax.
Insert empty variable-length arrays using{} and ARRAY[] syntax.
Insert array values that contain NULL elements.

Default Values

If you with column that has a default value, or to add a column with a default value, using the INSERT command creates a record that includes the default value if it is omitted from the INSERT. For example, assume a table created as follows:

If you omit the name column from an INSERT or INSERT FROM SELECT statement, the missing value for column name is set to 'John Doe'.

INSERT INTO tbl (id, age) VALUES (1, 36); creates the record 1|'John Doe'|36 .

INSERT INTO tbl (id, age) SELECT id, age FROM old_tbl; also sets all the name values to John Doe .

KILL QUERY

Interrupt a queued query. Specify the query by using its session ID.

To see the queries in the queue, use the SHOW QUERIES command:

show queries;
query_session_id|current_status      |executor_id|submitted     |query_str       |login_name|client_address            |db_name|exec_device_type
713-t1ax        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
491-xpfb        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |Patrick   |http:::1                  |omnisci|GPU
451-gp2c        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
190-5pax        |PENDING_EXECUTOR    |1          |2021-08-03 ...|SELECT ...      |Cavin     |http:::1                  |omnisci|GPU
720-nQtV        |RUNNING_QUERY_KERNEL|2          |2021-08-03 ...|SELECT ...      |Cavin     |tcp:::ffff:127.0.0.1:50142|omnisci|GPU
947-ooNP        |RUNNING_IMPORTER    |0          |2021-08-03 ...|IMPORT_GEO_TABLE|Rio       |tcp:::ffff:127.0.0.1:47314|omnisci|CPU

To interrupt the last query in the list (ID 946-ooNP):

kill query '946-ooNP'

Showing the queries again indicates that 946-ooNP has been deleted:

show queries;
query_session_id|current_status      |executor_id|submitted     |query_str       |login_name|client_address            |db_name|exec_device_type
713-t1ax        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
491-xpfb        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |Patrick   |http:::1                  |omnisci|GPU
451-gp2c        |PENDING_QUEUE       |0          |2021-08-03 ...|SELECT ...      |John      |http:::1                  |omnisci|GPU
190-5pax        |PENDING_EXECUTOR    |1          |2021-08-03 ...|SELECT ...      |Cavin     |http:::1                  |omnisci|GPU
720-nQtV        |RUNNING_QUERY_KERNEL|2          |2021-08-03 ...|SELECT ...      |Cavin     |tcp:::ffff:127.0.0.1:50142|omnisci|GPU

KILL QUERY is only available if the runtime query interrupt parameter (enable-runtime-query-interrupt) is set.
Interrupting a query in ‘PENDING_QUEUE’ status is supported in both distributed and single-server mode.
To enable query interrupt for tables imported from data files in local storage, set enable_non_kernel_time_query_interrupt to TRUE. (It is enabled by default.)

LIKELY/UNLIKELY

Expression

Description

LIKELY(X)

Provides a hint to the query planner that argument X is a Boolean value that is usually true. The planner can prioritize filters on the value X earlier in the execution cycle and return results more efficiently.

UNLIKELY(X)

Provides a hint to the query planner that argument X is a Boolean value that is usually not true. The planner can prioritize filters on the value X later in the execution cycle and return results more efficiently.

Usage Notes

SQL normally assumes that terms in the WHERE clause that cannot be used by indices are usually true. If this assumption is incorrect, it could lead to a suboptimal query plan. Use the LIKELY(X) and UNLIKELY(X) SQL functions to provide hints to the query planner about clause terms that are probably not true, which helps the query planner to select the best possible plan.

Use LIKELY/UNLIKELY to optimize evaluation of OR/AND logical expressions. LIKELY/UNLIKELY causes the left side of an expression to be evaluated first. This allows the right side of the query to be skipped when possible. For example, in the clause UNLIKELY(A) AND B, if A evaluates to FALSE, B does not need to be evaluated.

Consider the following:

SELECT COUNT(*) FROM test WHERE UNLIKELY(x IN (7, 8, 9, 10)) AND y > 42;

If x is one of the values 7, 8, 9, or 10, the filter y > 42 is applied. If x is not one of those values, the filter y > 42 is not applied.

UPDATE

Changes the values of the specified columns based on the assign argument (identifier=expression) in all rows that satisfy the condition in the WHERE clause.

UPDATE table_name SET assign [, assign ]* [ WHERE booleanExpression ]

Example

UPDATE UFOs SET shape='ovate' where shape='eggish';

Currently, HEAVY.AI does not support updating a geo column type (POINT, MULTIPOINT, LINESTRING, MULTILINESTRING, POLYGON, or MULTIPOLYGON) in a table.

Update Via Subquery

You can update a table via subquery, which allows you to update based on calculations performed on another table.

Examples

UPDATE test_facts SET lookup_id = (SELECT SAMPLE(test_lookup.id) 
FROM test_lookup WHERE test_lookup.val = test_facts.val);

UPDATE test_facts SET val = val+1, lookup_id = (SELECT SAMPLE(test_lookup.id)
FROM test_lookup WHERE test_lookup.val = test_facts.val);

UPDATE test_facts SET lookup_id = (SELECT SAMPLE(test_lookup.id) 
FROM test_lookup WHERE test_lookup.val = test_facts.val) WHERE id < 10;

Cross-Database Queries

In Release 6.4 and higher, you can run UPDATE queries across tables in different databases on the same HEAVY.AI cluster without having to first connect to those databases.

To execute queries against another database, you must have ACCESS privilege on that database, as well as UPDATE privilege.

Example

Update a row in a table in the my_other_db database:

UPDATE my_other_db.customers SET name = 'Joe' WHERE id = 10;

Arrays

HEAVY.AI supports arrays in dictionary-encoded text and number fields (TINYINT, SMALLINT, INTEGER, BIGINT, FLOAT, and DOUBLE). Data stored in arrays are not normalized. For example, {green,yellow} is not the same as {yellow,green}. As with many SQL-based services, OmniSci array indexes are 1-based.

HEAVY.AI supports NULL variable-length arrays for all integer and floating-point data types, including dictionary-encoded string arrays. For example, you can insert NULL into BIGINT[ ], DOUBLE[ ], or TEXT[ ] columns. HEAVY.AI supports NULL fixed-length arrays for all integer and floating-point data types, but not for dictionary-encoded string arrays. For example, you can insert NULL into BIGINT[2] DOUBLE[3], but not into TEXT[2] columns.

Expression

Description

ArrayCol[n] ...

Returns value(s) from specific location n in the array.

UNNEST(ArrayCol)

Extract the values in the array to a set of rows. Requires GROUP BY; projecting UNNEST is not currently supported.

test = ANY ArrayCol

ANY compares a scalar value with a single row or set of values in an array, returning results in which at least one item in the array matches. ANY must be preceded by a comparison operator.

test = ALL ArrayCol

ALL compares a scalar value with a single row or set of values in an array, returning results in which all records in the array field are compared to the scalar value. ALL must be preceded by a comparison operator.

CARDINALITY()

Returns the number of elements in an array. For example:

Examples

The following examples show query results based on the table test_array created with the following statement:

CREATE TABLE test_array (name TEXT ENCODING DICT(32),colors TEXT[] ENCODING DICT(32), qty INT[]);

omnisql> SELECT * FROM test_array;
name|colors|qty
Banana|{green, yellow}|{1, 2}
Cherry|{red, black}|{1, 1}
Olive|{green, black}|{1, 0}
Onion|{red, white}|{1, 1}
Pepper|{red, green, yellow}|{1, 2, 3}
Radish|{red, white}|{}
Rutabaga|NULL|{}
Zucchini|{green, yellow}|{NULL}

omnisql> SELECT UNNEST(colors) AS c FROM test_array;
Exception: UNNEST not supported in the projection list yet.

omnisql> SELECT UNNEST(colors) AS c, count(*) FROM test_array group by c;
c|EXPR$1
green|4
yellow|3
red|4
black|2
white|2

omnisql> SELECT name, colors [2] FROM test_array;
name|EXPR$1
Banana|yellow
Cherry|black
Olive|black
Onion|white
Pepper|green
Radish|white
Rutabaga|NULL
Zucchini|yellow

omnisql> SELECT name, colors FROM test_array WHERE colors[1]='green';
name|colors
Banana|{green, yellow}
Olive|{green, black}
Zucchini|{green, yellow}

omnisql> SELECT * FROM test_array WHERE colors IS NULL;
name|colors|qty
Rutabaga|NULL|{}

The following queries use arrays in an INTEGER field:

omnisql> SELECT name, qty FROM test_array WHERE qty[2] >1;
name|qty
Banana|{1, 2}
Pepper|{1, 2, 3}

omnisql> SELECT name, qty FROM test_array WHERE 15< ALL qty;
No rows returned.

omnisql> SELECT name, qty FROM test_array WHERE 2 = ANY qty;
name|qty
Banana|{1, 2}
Pepper|{1, 2, 3}

omnisql> SELECT COUNT(*) FROM test_array WHERE qty IS NOT NULL;
EXPR$0
8

omnisql> SELECT COUNT(*) FROM test_array WHERE CARDINALITY(qty)<0;
EXPR$0
6

Table Expression and Join Support

If a join column name or alias is not unique, it must be prefixed by its table name.

You can use BIGINT, INTEGER, SMALLINT, TINYINT, DATE, TIME, TIMESTAMP, or TEXT ENCODING DICT data types. TEXT ENCODING DICT is the most efficient because corresponding dictionary IDs are sequential and span a smaller range than, for example, the 65,535 values supported in a SMALLINT field. Depending on the number of values in your field, you can use TEXT ENCODING DICT(32) (up to approximately 2,150,000,000 distinct values), TEXT ENCODING DICT(16) (up to 64,000 distinct values), or TEXT ENCODING DICT(8) (up to 255 distinct values). For more information, see .

Geospatial Joins

When possible, joins involving a geospatial operator (such as ST_Contains) build a binned spatial hash table (overlaps hash join), falling back to a Cartesian loop join if a spatial hash join cannot be constructed.

The enable-overlaps-hashjoin flag controls whether the system attempts to use the overlaps spatial join strategy (true by default). If enable-overlaps-hashjoin is set to false, or if the system cannot build an overlaps hash join table for a geospatial join operator, the system attempts to fall back to a loop join. Loop joins can be performant in situations where one or both join tables have a small number of rows. When both tables grow large, loop join performance decreases.

Two flags control whether or not the system allows loop joins for a query (geospatial for not): allow-loop-joins and trivial-loop-join-threshold. By default, allow-loop-joins is set to false and trivial-loop-join-threshold to 1,000 (rows). If allow allow-loop-joins is set to true, the system allows any query with a loop join, regardless of table cardinalities (measured in number of rows). If left to the implicit default of false or set explicitly to false, the system allows loop join queries as long as the inner table (right-side table) has fewer rows than the threshold specified by trivial-loop-join-threshold.

For optimal performance, the system should utilize overlaps hash joins whenever possible. Use the following guidelines to maximize the use of the overlaps hash join framework and minimize fallback to loop joins when conducting geospatial joins:

The inner (right-side) table should always be the more complicated primitive. For example, for ST_Contains(polygon, point), the point table should be the outer (left) table and the polygon table should be the inner (right) table.
Currently, ST_CONTAINS and ST_INTERSECTS joins between point and polygons/multi-polygon tables, and ST_DISTANCE < {distance} between two point tables are supported for accelerated overlaps hash join queries.
For pointwise-distance joins, only the pattern WHERE ST_DISTANCE(table_a.point_col, table_b.point_col) < distance_in_degrees supports overlaps hash joins. Patterns like the following fall back to loop joins:
- WHERE ST_DWITHIN(table_a.point_col, table_b.point_col, distance_in_degrees)
- WHERE ST_DISTANCE(ST_TRANSFORM(table_a.point_col, 900913), ST_TRANSFORM(table_b.point_col, 900913)) < 100

Using Joins in a Distributed Environment

You can create joins in a distributed environment in two ways:

Replicate small dimension tables that are used in the join.
Create a shard key on the column used in the join (note that there is a limit of one shard key per table). If the column involved in the join is a TEXT ENCODED field, you must create a SHARED DICTIONARY that references the FACT table key you are using to make the join.

The join order for one small table and one large table matters. If you swap the sales and customer tables on the join, it throws an exception stating that table "sales" must be replicated.

Uber H3 Hexagonal Modeling

Uber H3 Functions

Overview

Uber H3 is an open-source geospatial system created by Uber Technologies . H3 provides a hierarchical grid system that divides the Earth's surface into hexagons of varying sizes, allowing for easy location-based indexing, search, and analysis.

Hexagons can be created at a single scale, for instance to fill an arbitrary polygon at one resolution (see below). They can also be used to generate a much-smaller number of hexagons at multiple scales. In general, operating on h3 hexagons is much faster than on raw arbitrary geometries, at a cost of some precision. Because each hexagon is exactly the same size, this is particularly advantageous for GPU-accelerated workflows.

Advantages

A principal advantage of the system is that for a given scale, hexagons are approximately-equal area. This stands in contrast to other subdivision schemes based on longitudes and latitudes or web Mercator map projections.

A second advantage is that with hexagons, neighbors in all directions are equidistant. This is not true for rectangular subdivisions like pixels, whose 8 neighbors are at different distances.

The exact amount of precision lost can be tightly bounded, with the smallest sized hexagons supported being about 1m2. That’s more accurate than most current available data sources, short of survey data.

Disadvantages

There are some disadvantages to be aware of. The first is that the world can not actually be divided up completely cleanly into hexagons. It turns out that a few pentagons are needed, and this introduces discontinuities. However the system has cleverly placed those pentagons far away from any land masses, so this is only practically a concern for specific maritime operations.

The second issue is that hexagons at adjacent scales do not nest exactly:

This doesn’t much affect practical operations at any single given scale. But if you look carefully at the California multiscale plot above you will discover tiny discontinuities in the form of gaps or overlaps. These don’t amount to a large percentage of the total area, but nonetheless mean this method is not appropriate when exact counts are required.

Supported Methods

Encodes columnar point geometry into a globally-unique h3 cell ID for the specified h3_scale. Scales run from 0 to 15 inclusive, where 0 represents the coarsest resolution and 15 the finest. For details on h3 scales, please see the base library documentation.

This can be applied to literal values:

Or to columnar geographic points:

Note that if you have geographic point data rather than columnar latitude and longitude, you can use the ST_X and ST_Y functions. Also, if you wish to encode the centroids of polygons, such as for building footprints, you can combine this with the ST_CENTROID function.

From Hex Codes to Geometry

To retrieve geometric coordinates from an H3 code, two functions are available.

h3ToLat and h3ToLon extract the latitude and longitude respectively, for example:

Following Parent Relationships

Given an H3 code, the function h3ToParent is available to find cells above that cell at any hierarchical level. This means that once codes are computed at high resolution, they can be compared to codes at other scales.

H3 Usage Notes

Uber's h3 python library provides a wider range of functions than those available above (although at significantly slower performance). The library defaults to generating h3 codes as hexadecimal strings, but can be configured to support BIGINT codes. Please see Uber's documentation for details.

H3 codes can be used in regular joins, including joins in Immerse. They can also be used as aggregators, such as in Immerse custom dimensions. For points which are exactly aligned, such as imports from raster data bands of the same source, aggregating on H3 codes is faster than the exact geographic overlaps function ST_EQUALS

generate_random_strings

Generates random string data.

SELECT * FROM TABLE(generate_random_strings(<num_strings>, <string_length>/)

Input Arguments

Parameter

Description

Data Type

<num_strings>

The number of strings to randomly generate.

BIGINT

<string_length>

Length of the generated strings.

BIGINT

Output Columns

Name

Description

Data Type

Integer id of output, starting at 0 and increasing monotonically

Column<BIGINT>

rand_str

Random String

Column<TEXT ENCODING DICT>

Example

heavysql> SELECT * FROM TABLE(generate_random_strings(10, 20);
id|rand_str
0 |He9UeknrGYIOxHzh5OZC
1 |Simnx7WQl1xRihLiH56u
2 |m5H1lBTOErpS8is00YJ
3 |eeDiNHfKzVQsSg0qHFS0
4 |JwOhUoQEI6Z0L78mj8jo
5 |kBTbSIMm25dvf64VMi
6 |W3lUUvC5ajm0W24JML
7 |XdtSQfdXQ85nvaIoyYUY
8 |iUTfGN5Jaj25LjGJhiRN
9 |72GUoTK2BzcBJVTgTGW

generate_series

generate_series (Integers)

Generate a series of integer values.

SELECT * FROM TABLE(
    generate_series(
        <series_start>,
        <series_end>
        [, <increment>]
    )

Input Arguments

Parameter

Description

Data Types

<series_start>

Starting integer value, inclusive.

BIGINT

<series_end>

Ending integer value, inclusive.

BIGINT

<series_step> (optional, defaults to 1)

Increment to increase or decrease and values that follow. Integer.

BIGINT

Output Columns

Name

Description

Data Types

generate_series

The integer series specified by the input arguments.

Column<BIGINT>

Example

heavysql> select * from table(generate_series(2, 10, 2)); 
series 
2 
4 
6 
8 
10 
5 rows returned.

heavysql> select * from table(generate_series(8, -4, -3)); 
series 
8 
5 
2 
-1 
-4
5 rows returned.

generate_series (Timestamps)

Generate a series of timestamp values from start_timestamp to end_timestamp .

SELECT * FROM TABLE(
    generate_series(
        <series_start>,
        <series_end>,
        <series_step>
    )
)

Input Arguments

Parameter

Description

Data Types

series_start

Starting timestamp value, inclusive.

TIMESTAMP(9) (Timestamp literals with other precisions will be auto-casted to TIMESTAMP(9) )

series_end

Ending timestamp value, inclusive.

TIMESTAMP(9) (Timestamp literals with other precisions will be auto-casted to TIMESTAMP(9) )

series_step

Time/Date interval signifying step between each element in the returned series.

INTERVAL

Output Columns

Name

Description

Output Types

generate_series

The timestamp series specified by the input arguments.

COLUMN<TIMESTAMP(9)>

Example

SELECT
  generate_series AS ts
FROM
  TABLE(
    generate_series(
      TIMESTAMP(0) '2021-01-01 00:00:00',
      TIMESTAMP(0) '2021-09-04 00:00:00',
      INTERVAL '1' MONTH
    )
  )
  ORDER BY ts;
  
ts
2021-01-01 00:00:00.000000000
2021-02-01 00:00:00.000000000
2021-03-01 00:00:00.000000000
2021-04-01 00:00:00.000000000
2021-05-01 00:00:00.000000000
2021-06-01 00:00:00.000000000
2021-07-01 00:00:00.000000000
2021-08-01 00:00:00.000000000
2021-09-01 00:00:00.000000000

tf_compute_dwell_times

Given a query input with entity keys (for example, user IP addresses) and timestamps (for example, page visit timestamps), and parameters specifying the minimum session time, the minimum number of session records, and the max inactive seconds, outputs all unique sessions found in the data with the duration of the session (dwell time).

Syntax

select * from table( 
  tf_compute_dwell_times( 
    data => CURSOR( 
      select 
        entity_id, 
        site_id, 
        ts, 
      from 
        <table> 
      where 
        ... 
        ), 
        min_dwell_seconds => <seconds>, 
        min_dwell_points => <points>, 
        max_inactive_seconds => <seconds> 
        ) 
      );

Input Arguments

Parameter

Description

Data Type

entity_id

Column containing keys/IDs used to identify the entities for which dwell/session times are to be computed. Examples include IP addresses of clients visiting a website, login IDs of database users, MMSIs of ships, and call signs of airplanes.

Column<TEXT ENCODING DICT | BIGINT>

site_id

Column containing keys/IDs of dwell “sites” or locations that entities visit. Examples include website pages, database session IDs, ports, airport names, or binned h3 hex IDs for geographic location.

Column<TEXT ENCODING DICT | BIGINT>

ts

Column denoting the time at which an event occurred.

Column<TIMESTAMP(0|3|6|0)>

min_dwell_seconds

Constant integer value specifying the minimum number of seconds required between the first and last timestamp-ordered record for an entity_id at a site_id to constitute a valid session and compute and return an entity’s dwell time at a site. For example, if this variable is set to 3600 (one hour), but only 1800 seconds elapses between an entity’s first and last ordered timestamp records at a site, these records are not considered a valid session and a dwell time for that session is not calculated.

BIGINT (other integer types are automatically casted to BIGINT)

min_dwell_points

A constant integer value specifying the minimum number of successive observations (in ts timestamp order) required to constitute a valid session and compute and return an entity’s dwell time at a site. For example, if this variable is set to 3, but only two consecutive records exist for a user at a site before they move to a new site, no dwell time is calculated for the user.

BIGINT (other integer types are automatically casted to BIGINT)

max_inactive_seconds

A constant integer value specifying the maximum time in seconds between two successive observations for an entity at a given site before the current session/dwell time is considered finished and a new session/dwell time is started. For example, if this variable is set to 86400 seconds (one day), and the time gap between two successive records for an entity id at a given site id is 86500 seconds, the session is considered ended at the first timestamp-ordered record, and a new session is started at the timestamp of the second record.

BIGINT (other integer types are automatically casted to BIGINT)

Output Columns

Name

Description

Data Type

entity_id

The ID of the entity for the output dwell time, identical to the corresponding entity_id column in the input.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the entity_id input column type)

site_id

The site ID for the output dwell time, identical to the corresponding site_id column in the input.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the site_id input column type)

prev_site_id

The site ID for the session preceding the current session, which might be a different site_id, the same site_id (if successive records for an entity at the same site were split into multiple sessions because the max_inactive_seconds threshold was exceeded), or null if the last site_id visited was null.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type is the same as the site_id input column type)

next_site_id

The site id for the session after the current session, which might be a different site_id, the same site_id (if successive records for an entity at the same site were split into multiple sessions due to exceeding the max_inactive_seconds threshold, or null if the next site_id visited was null.

Column<TEXT ENCODING DICT> | Column<BIGINT> (type will be the same as the site_id input column type)

session_id

An auto-incrementing session ID specific/relative to the current entity_id, starting from 1 (first session) up to the total number of valid sessions for an entity_id, such that each valid session dwell time increments the session_id for an entity by 1.

Column<INT>

start_seq_id

The index of the nth timestamp (ts-ordered) record for a given entity denoting the start of the current output row's session.

Column<INT>

dwell_time_sec

The duration in seconds for the session.

Column<INT>

num_dwell_points

The number of records/observations constituting the current output row's session.

Column<INT>

Example

/* Data from https://www.kaggle.com/datasets/vodclickstream/netflix-audience-behaviour-uk-movies */

select
  *
from
  table(
    tf_compute_dwell_times(
      data => cursor(
        select
          user_id,
          movie_id,
          ts
        from
          netflix_audience_behavior
      ),
      min_dwell_points => 3,
      min_dwell_seconds => 600,
      max_inactive_seconds => 10800
    )
  )
order by
  num_dwell_points desc
limit
  10;

entity_id|site_id|prev_site_id|next_site_id|session_id|start_seq_id|ts|dwell_time_sec|num_dwell_points
59416738c3|cbdf9820bc|d058594d1c|863b39bbe8|2|19|2017-02-21 15:12:11.000000000|4391|54
16d994f6dd|1bae944666|4f1cf3c2dc|NULL|5|61|2017-11-11 20:27:02.000000000|9570|36
3675d9ba4a|948f2b5bf6|948f2b5bf6|69cb38018a|2|11|2018-11-26 18:42:52.000000000|3600|34
da01959c0b|fd711679f9|1f579d43c3|NULL|5|90|2019-03-21 05:37:22.000000000|7189|31
23c52f9b50|df00041e47|df00041e47|NULL|2|39|2019-01-21 15:53:33.000000000|1227|29
da01959c0b|8ab46a0cb1|f1fffa6ff4|1f579d43c3|3|29|2019-03-12 04:33:01.000000000|6026|29
23c52f9b50|df00041e47|NULL|df00041e47|1|10|2019-01-21 15:33:39.000000000|1194|28
da01959c0b|1f579d43c3|8ab46a0cb1|fd711679f9|4|63|2019-03-17 02:01:49.000000000|7240|27
3261cb81a5|1cb40406ae|NULL|NULL|1|2|2019-04-28 20:48:24.000000000|11240|27
dbed64ce9e|c5830185ca|NULL|NULL|1|3|2019-03-01 06:43:32.000000000|7261|25

tf_feature_similarity

Given a query input of entity keys, feature columns, and a metric column, and a second query input specifying a search vector of feature columns and metric, computes the similarity of each entity in the first input to the search vector based on their similarity. The score is computed as the cosine similarity of the feature column(s) for each entity with the feature column(s) for the search vector, which can optionally be TF/IDF weighted.

Input Arguments

Parameter

Description

Data Type

Output Columns

Name

Description

Data Types

Example

tf_raster_graph_shortest_slope_weighted_path

Aggregate point data into x/y bins of a given size in meters to form a dense spatial grid, computing the specified aggregate (using agg_type) across all points in each bin as the output value for the bin. A Gaussian average is then taken over the neighboring bins, with the number of bins specified by neighborhood_fill_radius, optionally only filling in null-valued bins if fill_only_nulls is set to true.

The graph shortest path is then computed between an origin point on the grid specified by origin_x and origin_y and a destination point on the grid specified by destination_x and destination_y, where the shortest path is weighted by the nth exponent of the computed slope between a bin and its neighbors, with the nth exponent being specified by slope_weighted_exponent. A max allowed traversable slope can be specified by slope_pct_max, such that no traversal is considered or allowed between bins with absolute computed slopes greater than the percentage specified by slope_pct_max.

Input Arguments

Parameter

Description

Data Types

Output Columns

Heavy Immerse

Introduction to Heavy Immerse

Heavy Immerse is a browser-based data visualization client that runs on top of the GPU-powered HeavyDB. It provides instantaneous representations of your data, from basic charts to rich and complex visualizations.

Immerse is installed with HEAVY.AI Enterprise Edition.

To create dashboards and data visualizations, click DASHBOARDS. You can search for dashboards, and list them by most recent or alphabetically.

Click DATA to import and manipulate data.

Click SQL EDITOR to perform Data Definition and Data Manipulation tasks on the command line.

When you navigate between the three utilities, you can:

Hold the command (ctrl) key as you click a link to open the utility in a new tab/window in the background.
Hold shift+command (ctrl) as you click a link to open the utility in a new tab/window in the foreground.
Hold no keys as you click a link to replace the contents of the current window.

HELP CENTER provides access to Immerse version information, tutorials, demos, and documentation. It also includes a link for sending email to HEAVY.AI.

Clicking the user icon at the far right opens a drop-down box where you can select a different database, change your UI theme, or log out of Immerse:

Admin Portal

The Admin Portal is a collection of dashboards available in the included information_schema database in Heavy Immerse. The dashboards display point-in-time information of the HEAVY.AI platform resources and users of the system.

Access to system dashboards is controlled using Immerse privileges; only users with Admin privileges or users/roles with access to the information_schema database can access the system dashboards.

The information_schema database and Admin Portal dashboards and system tables are installed when you install or upgrade to HEAVY.AI 6.0. For more detailed information on the tables available in the Admin Portal, see System Tables.

With the Admin Portal, you can see:

Database monitoring and database and web server logs.
Real-time data reporting for the system.
Point-in-time resource metrics and user engagement dashboards.

When you log in to the information_schema database, you see the Request Logs and Monitoring, System Resources, and User Roles and Permissions dashboards.

Request Logs and Monitoring

By default, the Request Logs and Monitoring dashboard does not appear in the Admin portal. To turn on the dashboard, set the enable-logs-system-tables parameter to TRUE in heavy.conf and restart the database.

The Request Logs and Monitoring dashboard includes the following charts on three tabs:

Number of Requests
Number of Fatals and Errors
Number of Unique Users
Avg Request Time (ms)
Max Request Time (ms)
Number of Requests per Dashboard
Number of Requests per API
Number of Requests per User

Database Server Logs - Sortable by log timestamp, severity level, message, file location, process ID, query ID, thread ID, and node.
Database Queries - Sortable by log timestamp, query string, execution time, and total time.

Web Server Logs - Sortable by log timestamp, severity, and message.
Web Server Access Logs - Sortable by log timestamp, endpoint, HTTP status, HTTP method, IP address, and response size.

System Resources Dashboard

The System Resources dashboard includes the following charts on three tabs:

Databases - Names of all available databases
# of Tables - Total number of tables
# of Dashboards - Total number of dashboards
# of Tables Per Database
# of Dashboards Per Database
Tables - Sortable name, column count, and owner information for all tables.
Dashboards - Sortable name, last update time, and owner information for all databases.

CPU Memory Utilization - Free, used, and unallocated
GPU Memory Utilization - Free, used, and unallocated
Tables with Highest CPU Memory Utilization
Tables with Highest GPU Memory Utilization
Columns with Highest CPU Memory Utilization
Columns with Highest GPU Memory Utilization

Tables with Highest Storage Utilization
Total Used Storage

User Roles and Permissions Dashboard

The User Roles and Permission Dashboard includes the following charts:

# of Users - Total number of users on the system
# of Roles - Total number of roles on the system
# of Table Owners - Total number of table owners on the system
# of Dashboard Owners - Total number of dashboard owners on the system
Users - Sortable list of users on the system
User-Role Assignments - Mapping of role names to user names, sortable by role or user
Roles - Sortable list of roles on the system
Databases - Sortable list of databases on the system
User Permissions - Mapping of user or role name, permission type, and database, sortable by any column.

Control Panel

The Control Panel gives super users visibility into roles and users of the current database, as well as feature flags, system table dashboards, and log files for the current HeavyDB instance.

To open the Control Panel, click the Account icon and then click Control Panel.

The Control Panel is considered beta functionality. Currently, you cannot add, delete, or edit roles or users in the Control Panel. Feature flags cannot be modified through the Control Panel.

To access the Control Panel, users need to have super user privileges, or the role “immerse_control_panel” assigned.

Feature Flags

To see which feature flags are currently set in Immerse, click Feature Flags under Customization.

Currently, feature flags can only be viewed in Immerse; they cannot be set or removed.

System Dashboard and Log Files

Links to the the following System Table dashboards are available on the Control Panel:

System Resources
Request Logs and Monitoring
User Roles and Permissions

Links to the following log files are are available on the Control Panel:

Access Logs (Web Server)
All Logs (Web Server)
Error Logs (HeavyDB)
Info Logs (HeavyDB)
Warning Logs (HeavyDB)

Geospatial Capabilities

HEAVY.AI supports a subset of object types and functions for storing and writing queries for geospatial definitions.

Geospatial Datatypes

Type

Size

Example

LINESTRING

Variable

A sequence of 2 or more points and the lines that connect them. For example: LINESTRING(0 0,1 1,1 2)

MULTIPOLYGON

Variable

A set of one or more polygons. For example:MULTIPOLYGON(((0 0,4 0,4 4,0 4,0 0),(1 1,2 1,2 2,1 2,1 1)), ((-1 -1,-1 -2,-2 -2,-2 -1,-1 -1)))

POINT

Variable

A point described by two coordinates. When the coordinates are longitude and latitude, HEAVY.AI stores longitude first, and then latitude. For example: POINT(0 0)

POLYGON

Variable

A set of one or more rings (closed line strings), with the first representing the shape (external ring) and the rest representing holes in that shape (internal rings). For example: POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))

MULTIPOINT

Variable

A set of one or more points. For example: MULTIPOINT((0 0), (1 1), (2 2))

MULTILINESTRING

Variable

A set of one or more associated lines, each of two or more points. For example: MULTILINESTRING((0 0, 1 0, 2 0), (0 1, 1 1, 2 1))

For information about geospatial datatype sizes, see Storage and Compression in Datatypes.

For more information on WKT primitives, see Wikipedia: Well-known Text: Geometric objects.

HEAVY.AI supports SRID 4326 (WGS 84) and 900913 (Google Web Mercator), and 32601-32660,32701-32760 (Universal Transverse Mercator (UTM) Zones). When using geospatial fields, you set the SRID to determine which reference system to use. HEAVY.AI does not assign a default SRID.

CREATE TABLE simple_geo (
                          name TEXT ENCODING DICT(32), 
                          location GEOMETRY(POINT,4326)
                         );

If you do not set the SRID of the geo field in the table, you can set it in a SQL query using ST_SETSRID(column_name, SRID). For example, ST_SETSRID(a.pt,4326).

When representing longitude and latitude, the first coordinate is assumed to be longitude in HEAVY.AI geospatial primitives.

You create geospatial objects as geometries (planar spatial data types), which are supported by the planar geometry engine at run time. When you call ST_DISTANCE on two geometry objects, the engine returns the shortest straight-line planar distance, in degrees, between those points. For example, the following query returns the shortest distance between the point(s) in p1 and the polygon(s) in poly1:

SELECT ST_DISTANCE(p1, poly1) FROM geo1;

For information about importing data, see Importing Geospatial Data.

Geospatial Literals

Geospatial functions that expect geospatial object arguments accept geospatial columns, geospatial objects returned by other functions, or string literals containing WKT representations of geospatial objects. Supplying a WKT string is equivalent to calling a geometry constructor. For example, these two queries are identical:

SELECT COUNT(*) FROM geo1 WHERE ST_DISTANCE(p1, `POINT(1 2)`) < 1.0;
SELECT COUNT(*) FROM geo1 WHERE ST_DISTANCE(p1, ST_GeomFromText('POINT(1 2)')) < 1.0;

You can create geospatial literals with a specific SRID. For example:

SELECT ST_CONTAINS(
                     mpoly2, 
                     ST_GeomFromText('POINT(-71.064544 42.28787)', 4326)
                   )
                   FROM geo2;

Support for Geography

HEAVY.AI provides support for geography objects and geodesic distance calculations, with some limitations.

Exporting Coordinates from Immerse

HeavyDB supports import from any coordinate system supported by the Geospatial Data Abstraction Library (GDAL). On import, HeavyDB will convert to and store in WGS84 encoding, and rendering is accurate in Immerse.

However, no built-in way to reference the original coordinates currently exists in Immerse, and coordinates exported from Immerse will be WGS84 coordinates. You can work around this limitation by adding to the dataset a column or columns in non-geo format that could be included for display in Immerse (for example, in a popup) or on export.

Distance Calculation

Currently, HEAVY.AI supports spheroidal distance calculation between:

Two points using either SRID 4326 or 900913.
A point and a polygon/multipolygon using SRID 900913.

Using SRID 900913 results in variance compared to SRID 4326 as polygons approach the North and South Poles.

The following query returns the points and polygons within 1,000 meters of each other:

SELECT a.poly_name, b.pt_name FROM poly a, pt b 
WHERE ST_Distance(
   ST_Transform(b.heavyai_geo, 900913),
   ST_Transform(b.location, 900913))<1000;

See the tables in Geospatial Functions below for examples.

Geospatial Functions

HEAVY.AI supports the functions listed.

Geometry Constructors

Function

Description

ST_Centroid

Computes the geometric center of a geometry as a POINT.

ST_GeomFromText(WKT)

Return a specified geometry value from Well-known Text representation.

ST_GeomFromText(WKT, SRID)

Return a specified geometry value from Well-known Text representation and an SRID.

ST_GeogFromText(WKT)

Return a specified geography value from Well-known Text representation.

ST_GeogFromText(WKT, SRID)

Return a specified geography value from Well-known Text representation and an SRID.

ST_Point(double lon, double lat)

Return a point constructed on the fly from the provided coordinate values. Constant coordinates result in construction of a POINT literal.

Example: ST_Contains(poly4326, ST_SetSRID(ST_Point(lon, lat), 4326))

Geometry to String Conversion

Function

Description

ST_AsText(geom) | ST_AsWKT(geom)

Converts a geometry input to a Well-Known-Text (WKT) string

ST_AsBinary(geom) | ST_AsWKB(geom)

Converts a geometry input to a Well-Known-Binary (WKB) string

Geometry Processing

Function

Description

ST_Buffer

Returns a geometry covering all points within a specified distance from the input geometry. Performed by the GEOS module. The output is currently limited to the MULTIPOLYGON type.

Calculations are in the units of the input geometry’s SRID. Buffer distance is expressed in the same units. Example:

SELECT ST_Buffer('LINESTRING(0 0, 10 0, 10 10)', 1.0);

Special processing is automatically applied to WGS84 input geometries (SRID=4326) to limit buffer distortion:

Implementation first determines the best planar SRID to which to project the 4326 input geometry.
Preferred SRIDs are UTM and Lambert (LAEA) North/South zones, with Mercator used as a fallback.
Buffer distance is interpreted as distance in meters (units of all planar SRIDs being considered).
The input geometry is transformed to the best planar SRID and handed to GEOS, along with buffer distance.
The buffer geometry built by GEOS is then transformed back to SRID=4326 and returned.

Example: Build 10-meter buffer geometries (SRID=4326) with limited distortion:

SELECT ST_Buffer(poly4326, 10.0) FROM tbl;

ST_Centroid

Computes the geometric center of a geometry as a POINT.

FunctionDescription Special processing is automatically applied to WGS84 input geometries (SRID=4326) to limit buffer distortion:

Implementation first determines the best planar SRID to which to project the 4326 input geometry.
Preferred SRIDs are UTM and Lambert (LAEA) North/South zones, with Mercator used as a fallback.
Buffer distance is interpreted as distance in meters (units of all planar SRIDs being considered).
The input geometry is transformed to the best planar SRID and handed to GEOS, along with buffer distance.
The buffer geometry built by GEOS is then transformed back to SRID=4326 and returned.

Example: Build 10-meter buffer geometries (SRID=4326) with limited distortion:SELECT ST_Buffer(poly4326, 10.0) FROM tbl; .ST_CentroidComputes the geometric center of a geometry as a POINT.

Geometry Editors

Function

Description

ST_TRANSFORM

Returns a geometry with its coordinates transformed to a different spatial reference. Currently, WGS84 to Web Mercator transform is supported. For example:ST_DISTANCE( ST_TRANSFORM(ST_GeomFromText('POINT(-71.064544 42.28787)', 4326), 900913), ST_GeomFromText('POINT(-13189665.9329505 3960189.38265416)', 900913) )

ST_TRANSFORM is not currently supported in projections. It can be used only to transform geo inputs to other functions, such as ST_DISTANCE.

ST_SETSRID

Set the to a specific integer value. For example:

ST_TRANSFORM(

ST_SETSRID(ST_GeomFromText('POINT(-71.064544 42.28787)'), 4326), 900913 )

Geometry Accessors

Function

Description

ST_X

Returns the X value from a POINT column.

ST_Y

Returns the Y value from a POINT column.

ST_XMIN

Returns X minima of a geometry.

ST_XMAX

Returns X maxima of a geometry.

ST_YMIN

Returns Y minima of a geometry.

ST_YMAX

Returns Y maxima of a geometry.

ST_STARTPOINT

Returns the first point of a LINESTRING as a POINT.

ST_ENDPOINT

Returns the last point of a LINESTRING as a POINT.

ST_POINTN

Return the Nth point of a LINESTRING as a POINT.

ST_NPOINTS

Returns the number of points in a geometry.

ST_NRINGS

Returns the number of rings in a POLYGON or a MULTIPOLYGON.

ST_SRID

Returns the spatial reference identifier for the underlying object.

ST_NUMGEOMETRIES

Returns the MULTI count of MULTIPOINT, MULTILINESTRING or MULTIPOLYGON. Returns 1 for non-MULTI geometry.

Overlay Functions

Function

Description

ST_INTERSECTION

Returns a geometry representing an intersection of two geometries; that is, the section that is shared between the two input geometries. Performed by the GEOS module.

The output is currently limited to MULTIPOLYGON type, because HEAVY.AI does not support mixed geometry types within a geometry column, and ST_INTERSECTION can potentially return points, lines, and polygons from a single intersection operation. Lower-dimension intersecting features such as points and line strings are returned as very small buffers around those features. If needed, true points can be recovered by applying the ST_CENTROID method to point intersection results. In addition, ST_PERIMETER/2 of resulting line intersection polygons can be used to approximate line length. Empty/NULL geometry outputs are not currently supported.

Examples: SELECT ST_Intersection('POLYGON((0 0,3 0,3 3,0 3))', 'POLYGON((1 1,4 1,4 4,1 4))'); SELECT ST_Area(ST_Intersection(poly, 'POLYGON((1 1,3 1,3 3,1 3,1 1))')) FROM tbl;

ST_DIFFERENCE

Returns a geometry representing the portion of the first input geometry that does not intersect with the second input geometry. Performed by the GEOS module. Input order is important; the return geometry is always a section of the first input geometry.

The output is currently limited to MULTIPOLYGON type, for the same reasons described in ST_INTERSECTION. Similar post-processing methods can be applied if needed. Empty/NULL geometry outputs are not currently supported.

Examples: SELECT ST_Difference('POLYGON((0 0,3 0,3 3,0 3))', 'POLYGON((1 1,4 1,4 4,1 4))'); SELECT ST_Area(ST_Difference(poly, 'POLYGON((1 1,3 1,3 3,1 3,1 1))')) FROM tbl;

ST_UNION

Returns a geometry representing the union (or combination) of the two input geometries. Performed by the GEOS module.

The output is currently limited to MULTIPOLYGON type for the same reasons described in ST_INTERSECTION. Similar post-processing methods can be applied if needed. Empty/NULL geometry outputs are not currently supported.

Examples: SELECT ST_UNION('POLYGON((0 0,3 0,3 3,0 3))', 'POLYGON((1 1,4 1,4 4,1 4))'); SELECT ST_AREA(ST_UNION(poly, 'POLYGON((1 1,3 1,3 3,1 3,1 1))')) FROM tbl;

Spatial Relationships and Measurements

Function

Description

ST_DISTANCE

Returns shortest planar distance between geometries. For example: ST_DISTANCE(poly1, ST_GeomFromText('POINT(0 0)')) Returns shortest geodesic distance between two points, in meters, if given two point geographies. Point geographies can be specified through casts from point geometries or as literals. For example: ST_DISTANCE( CastToGeography(p2), ST_GeogFromText('POINT(2.5559 49.0083)', 4326) )

SELECT a.name, ST_DISTANCE( CAST(a.pt AS GEOGRAPHY), CAST(b.pt AS GEOGRAPHY) ) AS dist_meters FROM starting_point a, destination_points b;

You can also calculate the distance between a POLYGON and a POINT. If both fields use SRID 4326, then the calculated distance is in 4326 units (degrees). If both fields use SRID 4326, and both are transformed into 900913, then the results are in 900913 units (meters).

The following SQL code returns the names of polygons where the distance between the point and polygon is less than 1,000 meters.

SELECT a.poly_name FROM poly a, point b WHERE ST_DISTANCE( ST_TRANSFORM(b.location,900913), ST_TRANSFORM(a.heavyai_geo,900913) ) < 1000;

ST_EQUALS

Returns TRUE if the first input geometry and the second input geometry are spatially equal; that is, they occupy the same space. Different orderings of points can be accepted as equal if they represent the same geometry structure.

POINTs comparison is performed natively. All other geometry comparisons are performed by GEOS.

If input geometries are both uncompressed or compressed, all comparisons to identify equality are precise. For mixed combinations, the comparisons are performed with a compression-specific tolerance that allows recognition of equality despite subtle precision losses that the compression may introduce. Note: Geo columns and literals with SRID=4326 are compressed by default.

Examples: SELECT COUNT(*) FROM tbl WHERE ST_EQUALS('POINT(2 2)', pt); SELECT ST_EQUALS('POLYGON ((0 0,1 0,0 1))', 'POLYGON ((0 0,0 0.5,0 1,1 0,0 0))');

ST_MAXDISTANCE

Returns longest planar distance between geometries. In effect, this is the diameter of a circle that encloses both geometries.For example:

Currently supported variants:

ST_CONTAINS

Returns true if the first geometry object contains the second object. For example:

You can also use ST_CONTAINS to:

Return the count of polys that contain the point (here as WKT): SELECT count(*) FROM geo1 WHERE ST_CONTAINS(poly1, 'POINT(0 0)');
Return names from a polys table that contain points in a points table: SELECT a.name FROM polys a, points b WHERE ST_CONTAINS(a.heavyai_geo, b.location);
Return names from a polys table that contain points in a points table, using a single point in WKT instead of a field in another table: SELECT name FROM poly WHERE ST_CONTAINS( heavyai_geo, ST_GeomFromText('POINT(-98.4886935 29.4260508)', 4326) );

ST_INTERSECTS

Returns true if two geometries intersect spatially, false if they do not share space. For example:

SELECT ST_INTERSECTS( 'POLYGON((0 0, 2 0, 2 2, 0 2, 0 0))', 'POINT(1 1)' ) FROM tbl;

ST_AREA

Returns the area of planar areas covered by POLYGON and MULTIPOLYGON geometries. For example:

SELECT ST_AREA( 'POLYGON((1 0, 0 1, -1 0, 0 -1, 1 0),(0.1 0, 0 0.1, -0.1 0, 0 -0.1, 0.1 0))' ) FROM tbl;

ST_AREA does not support calculation of geographic areas, but rather uses planar coordinates. Geographies must first be projected in order to use ST_AREA. You can do this ahead of time before import or at runtime, ideally using an equal area projection (for example, a national equal-area Lambert projection). The area is calculated in the projection's units. For example, you might use Web Mercator runtime projection to get the area of a polygon in square meters:

ST_AREA( ST_TRANSFORM( ST_GeomFromText( 'POLYGON((-76.6168198439371 39.9703199555959, -80.5189990254673 40.6493554919257, -82.5189990254673 42.6493554919257, -76.6168198439371 39.9703199555959) )', 4326 ), 900913) )

Web Mercator is not an equal area projection, however. Unless compensated by a scaling factor, Web Mercator areas can vary considerably by latitude.

ST_PERIMETER

Returns the cartesian perimeter of POLYGON and MULTIPOLYGON geometries. For example: SELECT ST_PERIMETER('POLYGON( (1 0, 0 1, -1 0, 0 -1, 1 0), (0.1 0, 0 0.1, -0.1 0, 0 -0.1, 0.1 0) )' ) from tbl; It will also return the geodesic perimeter of POLYGON and MULTIPOLYGON geometries. For example:

SELECT ST_PERIMETER( ST_GeogFromText( 'POLYGON( (-76.6168198439371 39.9703199555959, -80.5189990254673 40.6493554919257, -82.5189990254673 42.6493554919257, -76.6168198439371 39.9703199555959) )', 4326) ) from tbl;

ST_LENGTH

Returns the cartesian length of LINESTRING geometries. For example: SELECT ST_LENGTH('LINESTRING(1 0, 0 1, -1 0, 0 -1, 1 0)') FROM tbl; It also returns the geodesic length of LINESTRING geographies. For example:

SELECT ST_LENGTH( ST_GeogFromText('LINESTRING( -76.6168198439371 39.9703199555959, -80.5189990254673 40.6493554919257, -82.5189990254673 42.6493554919257)', 4326) ) FROM tbl;

ST_WITHIN

Returns true if geometry A is completely within geometry B. For example the following SELECT statement returns true:

SELECT ST_WITHIN( 'POLYGON ((1 1, 1 2, 2 2, 2 1))', 'POLYGON ((0 0, 0 3, 3 3, 3 0))' ) FROM tbl;

ST_DWITHIN

Returns true if the geometries are within the specified distance of each one another. Distance is specified in units defined by the spatial reference system of the geometries. For example: SELECT ST_DWITHIN( 'POINT(1 1)', 'LINESTRING (1 2,10 10,3 3)', 2.0 ) FROM tbl; ST_DWITHIN supports geodesic distances between geographies, currently limited to geographic points. For example, you can check whether Los Angeles and Paris, specified as WGS84 geographic point literals, are within 10,000km of one another.

SELECT ST_DWITHIN(

ST_GeogFromText( 'POINT(-118.4079 33.9434)', 4326), ST_GeogFromText('POINT(2.5559 49.0083)', 4326 ), 10000000.0) FROM tbl;

ST_DFULLYWITHIN

Returns true if the geometries are fully within the specified distance of one another. Distance is specified in units defined by the spatial reference system of the geometries. For example: SELECT ST_DFULLYWITHIN( 'POINT(1 1)', 'LINESTRING (1 2,10 10,3 3)', 10.0) FROM tbl; This function supports:

ST_DFULLYWITHIN(POINT, LINESTRING, distance) ST_DFULLYWITHIN(LINESTRING, POINT, distance)

ST_DISJOINT

Returns true if the geometries are spatially disjoint (that is, the geometries do not overlap or touch. For example:

SELECT ST_DISJOINT( 'POINT(1 1)', 'LINESTRING (0 0,3 3)' ) FROM tbl;

Additional Geo Notes

You can use SQL code similar to the examples in this topic as global filters in Immerse.
CREATE TABLE AS SELECT is not currently supported for geo data types in distributed mode.
GROUP BY is not supported for geo types (POINT, MULTIPOINT, LINESTRING, MULTILINESTRING, POLYGON, or MULTIPOLYGON.

You can use \d table_name to determine if the SRID is set for the geo field:

heavysql> \d starting_point
CREATE TABLE starting_point (
                               name TEXT ENCODING DICT(32),
                               myPoint GEOMETRY(POINT, 4326) ENCODING COMPRESSED(32)
                             )

If no SRID is returned, you can set the SRID using ST_SETSRID(column_name, SRID). For example, ST_SETSRID(myPoint, 4326).

Tables

DDL - Tables

These functions are used to create and modify data tables in HEAVY.AI.

Nomenclature Constraints

Table names must use the NAME format, described in regex notation as:

[A-Za-z_][A-Za-z0-9\$_]*

Table and column names can include quotes, spaces, and the underscore character. Other special characters are permitted if the name of the table or column is enclosed in double quotes (" ").

Spaces and special characters other than underscore (_) cannot be used in Heavy Immerse.
Column and table names enclosed in double quotes cannot be used in Heavy Immerse

CREATE TABLE

CREATE [TEMPORARY] TABLE [IF NOT EXISTS] <table>
  (<column> <type> [NOT NULL] [DEFAULT <value>] [ENCODING <encodingSpec>],
  [SHARD KEY (<column>)],
  [SHARED DICTIONARY (<column>) REFERENCES <table>(<column>)], ...)
  [WITH (<property> = value, ...)];

Create a table named <table> specifying <columns> and table properties.

Supported Datatypes

Datatype

Size (bytes)

Notes

BIGINT

Minimum value: -9,223,372,036,854,775,807; maximum value: 9,223,372,036,854,775,807.

BOOLEAN

TRUE: 'true', '1', 't'. FALSE: 'false', '0', 'f'. Text values are not case-sensitive.

DATE*

Same as DATE ENCODING DAYS(32).

DATE ENCODING DAYS(32)

Range in years: +/-5,883,517 around epoch. Maximum date January 1, 5885487 (approximately). Minimum value: -2,147,483,648; maximum value: 2,147,483,647. Supported formats when using COPY FROM: mm/dd/yyyy, dd-mmm-yy, yyyy-mm-dd, dd/mmm/yyyy.

DATE ENCODING DAYS(16)

Range in days: -32,768 - 32,767 Range in years: +/-90 around epoch, April 14, 1880 - September 9, 2059. Minumum value: -2,831,155,200; maximum value: 2,831,068,800. Supported formats when using COPY FROM: mm/dd/yyyy, dd-mmm-yy, yyyy-mm-dd, dd/mmm/yyyy.

DATE ENCODING FIXED(32)

In DDL statements defaults to DATE ENCODING DAYS(16). Deprecated.

DATE ENCODING FIXED(16)

In DDL statements defaults to DATE ENCODING DAYS(16). Deprecated.

DECIMAL

2, 4, or 8

Takes precision and scale parameters: DECIMAL(precision,scale).

Size depends on precision:

Up to 4: 2 bytes
5 to 9: 4 bytes
10 to 18 (maximum): 8 bytes

Scale must be less than precision.

DOUBLE

Variable precision. Minimum value: -1.79 x e^308; maximum value: 1.79 x e^308.

FLOAT

Variable precision. Minimum value: -3.4 x e^38; maximum value: 3.4 x e^38.

INTEGER

Minimum value: -2,147,483,647; maximum value: 2,147,483,647.

SMALLINT

Minimum value: -32,767; maximum value: 32,767.

TEXT ENCODING DICT

Max cardinality 2 billion distinct string values

TEXT ENCODING NONE

Variable

Size of the string + 6 bytes

TIME

Minimum value: 00:00:00; maximum value: 23:59:59.

TIMESTAMP

Linux timestamp from -30610224000 (1/1/1000 00:00:00.000) through 29379542399 (12/31/2900 23:59:59.999).

Can also be inserted and stored in human-readable format:

YYYY-MM-DD HH:MM:SS
YYYY-MM-DDTHH:MM:SS (The T is dropped when the field is populated.)

TINYINT

Minimum value: -127; maximum value: 127.

* In HEAVY.AI release 4.4.0 and higher, you can use existing 8-byte DATE columns, but you can create only 4-byte DATE columns (default) and 2-byte DATE columns (see DATE ENCODING FIXED(16)).

For more information, see Datatypes and Fixed Encoding.

For geospatial datatypes, see Geospatial Primitives.

Examples

Create a table named tweets and specify the columns, including type, in the table.

CREATE TABLE IF NOT EXISTS tweets (
   tweet_id BIGINT NOT NULL,
   tweet_time TIMESTAMP NOT NULL ENCODING FIXED(32),
   lat FLOAT,
   lon FLOAT,
   sender_id BIGINT NOT NULL,
   sender_name TEXT NOT NULL ENCODING DICT,
   location TEXT ENCODING  DICT,
   source TEXT ENCODING DICT,
   reply_to_user_id BIGINT,
   reply_to_tweet_id BIGINT,
   lang TEXT ENCODING  DICT,
   followers INT,
   followees INT,
   tweet_count INT,
   join_time TIMESTAMP ENCODING  FIXED(32),
   tweet_text TEXT,
   state TEXT ENCODING  DICT,
   county TEXT ENCODING DICT,
   place_name TEXT,
   state_abbr TEXT ENCODING DICT,
   county_state TEXT ENCODING DICT,
   origin TEXT ENCODING DICT,
   phone_numbers bigint);

Create a table named delta and assign a default value San Francisco to column city.

CREATE TABLE delta (
   id INTEGER NOT NULL, 
   name TEXT NOT NULL, 
   city TEXT NOT NULL DEFAULT 'San Francisco' ENCODING DICT(16));

Default values currently have the following limitations:

Only literals can be used for column DEFAULT values; expressions are not supported.
You cannot define a DEFAULT value for a shard key. For example, the following does not parse: CREATE TABLE tbl (id INTEGER NOT NULL DEFAULT 0, name TEXT, shard key (id)) with (shard_count = 2);
For arrays, use the following syntax: ARRAY[A, B, C, …. N]
The syntax {A, B, C, ... N} is not supported.
Some literals, like NUMERIC and GEO types, are not checked at parse time. As a result, you can define and create a table with malformed literal as a default value, but when you try to insert a row with a default value, it will throw an error.

Supported Encoding

Encoding

Descriptions

DICT

Dictionary encoding on string columns (default for TEXT columns). Limit of 2 billion unique string values.

FIXED (bits)

Fixed length encoding of integer or timestamp columns. See .

NONE

No encoding. Valid only on TEXT columns. No Dictionary is created. Aggregate operations are not possible on this column type.

WITH Clause Properties

Property

Description

fragment_size

Number of rows per fragment that is a unit of the table for query processing. Default: 32 million rows, which is not expected to be changed.

max_rollback_epochs

Limit the number of epochs a table can be rolled back to. Limiting the number of epochs helps to limit the amount of on-disk data and prevent unmanaged data growth.

Limiting the number of rollback epochs also can increase system startup speed, especially for systems on which data is added in small batches or singleton inserts. Default: 3.

The following example creates the table test_table and sets the maximum epoch rollback number to 50:

CREATE TABLE test_table(a int) WITH (MAX_ROLLBACK_EPOCHS = 50);

max_rows

Used primarily for streaming datasets to limit the number of rows in a table, to avoid running out of memory or impeding performance. When the max_rows limit is reached, the oldest fragment is removed. When populating a table from a file, make sure that your row count is below the max_rows setting. If you attempt load more rows at one time than the max_rows setting defines, the records up to the max_rows limit are removed, leaving only the additional rows. Default: 2^62. In a distributed system, the maximum number of rows is calculated as max_rows * leaf_count. In a sharded distributed system, the maximum number of rows is calculated as max_rows * shard_count.

page_size

Number of I/O page bytes. Default: 1MB, which does not need to be changed.

partitions

Partition strategy option:

SHARDED: Partition table using sharding.
REPLICATED: Partition table using replication.

shard_count

Number of shards to create, typically equal to the number of GPUs across which the data table is distributed.

sort_column

Name of the column on which to sort during bulk import.

Sharding

Sharding partitions a database table across multiple servers so each server has a part of the table with the same columns but with different rows. Partitioning is based on a sharding key defined when you create the table.

Without sharding, the dimension tables involved in a join are replicated and sent to each GPU, which is not feasible for dimension tables with many rows. Specifying a shard key makes it possible for the query to execute efficiently on large dimension tables.

Currently, specifying a shard key is useful for joins, only:

If two tables specify a shard key with the same type and the same number of shards, a join on that key only sends a part of the dimension table column data to each GPU.
For multi-node installs, the dimension table does not need to be replicated and the join executes locally on each leaf.

Constraints

A shard key must specify a single column to shard on. There is no support for sharding by a combination of keys.
One shard key can be specified for a table.
Data are partitioned according to the shard key and the number of shards (shard_count).
A value in the column specified as a shard key is always sent to the same partition.
The number of shards should be equal to the number of GPUs in the cluster.
Sharding is allowed on the following column types:
- DATE
- INT
- TEXT ENCODING DICT
- TIME
- TIMESTAMP
Tables must share the dictionary for the column to be involved in sharded joins. If the dictionary is not specified as shared, the join does not take advantage of sharding. Dictionaries are reference-counted and only dropped when the last reference drops.

Recommendations

Set shard_count to the number of GPUs you eventually want to distribute the data table across.
Referenced tables must also be shard_count -aligned.
Sharding should be minimized because it can introduce load skew accross resources, compared to when sharding is not used.

Examples

Basic sharding:

CREATE TABLE  customers(
   accountId text,
   name text,
   SHARD KEY (accountId))
  WITH (shard_count = 4);

Sharding with shared dictionary:

CREATE TABLE transactions(
   accountId text,
   action text,
   SHARD KEY (accountId),
   SHARED DICTIONARY (accountId) REFERENCES customers(accountId))
  WITH (shard_count = 4);

Temporary Tables

Using the TEMPORARY argument creates a table that persists only while the server is live. They are useful for storing intermediate result sets that you access more than once.

Adding or dropping a column from a temporary table is not supported.

Example

CREATE TEMPORARY TABLE customers(
   accountId TEXT,
   name TEXT,
   timeCreated TIMESTAMP)

CREATE TABLE AS SELECT

CREATE TABLE [IF NOT EXISTS] <newTableName> AS (<SELECT statement>) [WITH (<property> = value, ...)];

Create a table with the specified columns, copying any data that meet SELECT statement criteria.

WITH Clause Properties

Property

Description

fragment_size

Number of rows per fragment that is a unit of the table for query processing. Default = 32 million rows, which is not expected to be changed.

max_chunk_size

Size of chunk that is a unit of the table for query processing. Default: 1073741824 bytes (1 GB), which is not expected to be changed.

max_rows

Used primarily for streaming datasets to limit the number of rows in a table. When the max_rows limit is reached, the oldest fragment is removed. When populating a table from a file, make sure that your row count is below the max_rows setting. If you attempt load more rows at one time than the max_rows setting defines, the records up to the max_rows limit are removed, leaving only the additional rows. Default = 2^62.

page_size

Number of I/O page bytes. Default = 1MB, which does not need to be changed.

partitions

Partition strategy option:

SHARDED: Partition table using sharding.
REPLICATED: Partition table using replication.

use_shared_dictionaries

Controls whether the created table creates its own dictionaries for text columns, or instead shares the dictionaries of its source table. Uses shared dictionaries by default (true), which increases the speed of table creation.

Setting to false shrinks the dictionaries if SELECT for the created table has a narrow filter; for example: CREATE TABLE new_table AS SELECT * FROM old_table WITH (USE_SHARED_DICTIONARIES='false');

vacuum

Formats the table to more efficiently handle DELETE requests. The only parameter available is delayed. Rather than immediately remove deleted rows, vacuum marks items to be deleted, and they are removed at an optimal time.

Examples

Create the table newTable. Populate the table with all information from the table oldTable, effectively creating a duplicate of the original table.

CREATE TABLE newTable AS (SELECT * FROM oldTable);

Create a table named trousers. Populate it with data from the columns name, waist, and inseam from the table wardrobe.

CREATE TABLE trousers AS (SELECT name, waist, inseam FROM wardrobe);

Create a table named cosmos. Populate it with data from the columns star and planet from the table universe where planet has the class M.

CREATE TABLE IF NOT EXISTS cosmos AS (SELECT star, planet FROM universe WHERE class='M');

ALTER TABLE

ALTER TABLE <table> RENAME TO <table>;
ALTER TABLE <table> RENAME COLUMN <column> TO <column>;
ALTER TABLE <table> ADD [COLUMN] <column> <type> [NOT NULL] [ENCODING <encodingSpec>];
ALTER TABLE <table> ADD (<column> <type> [NOT NULL] [ENCODING <encodingSpec>], ...);
ALTER TABLE <table> ADD (<column> <type> DEFAULT <value>);
ALTER TABLE <table> DROP COLUMN <column_1>[, <column_2>, ...];
ALTER TABLE <table> SET MAX_ROLLBACK_EPOCHS=<value>;
ALTER TABLE <table> ALTER COLUMN <column> TYPE <type>, ALTER COLUMN <column> TYPE <type>, ...;

Examples

Rename the table tweets to retweets.

ALTER TABLE tweets RENAME TO retweets;

Rename the column source to device in the table retweets.

ALTER TABLE retweets RENAME COLUMN source TO device;

Add the column pt_dropoff to table tweets with a default value point(0,0).

ALTER TABLE tweets ADD COLUMN pt_dropoff POINT DEFAULT 'point(0 0)';

Add multiple columns a, b, and c to table table_one with a default value of 15 for column b.

ALTER TABLE table_one ADD a INTEGER, b INTEGER NOT NULL DEFAULT 15, c TEXT;

Default values currently have the following limitations:

Only literals can be used for column DEFAULT values; expressions are not supported.
For arrays, use the following syntax: ARRAY[A, B, C, …. N]. The syntax {A, B, C, ... N} is not supported.
Some literals, like NUMERIC and GEO types, are not checked at parse time. As a result, you can define and create a table with a malformed literal as a default value, but when you try to insert a row with a default value, it throws an error.

Add the column lang to the table tweets using a TEXT ENCODING DICTIONARY.

ALTER TABLE tweets ADD COLUMN lang TEXT ENCODING DICT;

Add the columns lang and encode to the table tweets using a TEXT ENCODING DICTIONARY for each.

ALTER TABLE tweets ADD (lang TEXT ENCODING DICT, encode TEXT ENCODING DICT);

Drop the column pt_dropoff from table tweets.

ALTER TABLE tweets DROP COLUMN pt_dropoff;

Limit on-disk data growth by setting the number of allowed epoch rollbacks to 50:

ALTER TABLE test_table SET MAX_ROLLBACK_EPOCHS=50;

You cannot add a dictionary-encoded string column with a shared dictionary when using ALTER TABLE ADD COLUMN.
Currently, HEAVY.AI does not support adding a geo column type (POINT, LINESTRING, POLYGON, or MULTIPOLYGON) to a table.
HEAVY.AI supports ALTER TABLE RENAME TABLE and ALTER TABLE RENAME COLUMN for temporary tables. HEAVY.AI does not support ALTER TABLE ADD COLUMN to modify a temporary table.

Change a text column “id” to an integer column:

ALTER TABLE my_table ALTER COLUMN id TYPE INTEGER;

Change text columns “id” and “location” to big integer and point columns respectively:

ALTER TABLE my_table ALTER COLUMN id TYPE BIGINT, ALTER COLUMN location TYPE GEOMETRY(POINT, 4326);

Currently, only text column types (dictionary encoded and none encoded text columns) can be altered.

DROP TABLE

DROP TABLE [IF EXISTS] <table>;

Deletes the table structure, all data from the table, and any dictionary content unless it is a shared dictionary. (See the Note regarding disk space reclamation.)

Example

DROP TABLE IF EXISTS tweets;

DUMP TABLE

DUMP TABLE <table> TO '<filepath>' [WITH (COMPRESSION='<compression_program>')];

Archives data and dictionary files of the table <table> to file <filepath>.

Valid values for <compression_program> include:

gzip (default)
pigz
lz4
none

If you do not choose a compression option, the system uses gzip if it is available. If gzip is not installed, the file is not compressed.

The file path must be enclosed in single quotes.

Dumping a table locks writes to that table. Concurrent reads are supported, but you cannot import to a table that is being dumped.
The DUMP command is not supported on distributed configurations.
You must have a least GRANT CREATE ON DATABASE privilege level to use the DUMP command.

Example

DUMP TABLE tweets TO '/opt/archive/tweetsBackup.gz' WITH (COMPRESSION='gzip');

RENAME TABLE

RENAME TABLE <table> TO <table>[, <table> TO <table>, <table> TO <table>...];

Rename a table or multiple tables at once.

Examples

Rename a single table:

RENAME TABLE table_A TO table_B;

Swap table names:

RENAME TABLE table_A TO table_B, table_B TO table_A;

RENAME TABLE table_A TO table_B, table_B TO table_C, table_C TO table_A;

Swap table names multiple times:

RENAME TABLE table_A TO table_A_stale, table_B TO table_B_stale, table_A_new TO table_A, table_B_new TO table_B;

RESTORE TABLE

RESTORE TABLE <table> FROM '<filepath>' [WITH (COMPRESSION='<compression_program>')];

Restores data and dictionary files of table <table> from the file at <filepath>. If you specified a compression program when you used the DUMP TABLE command, you must specify the same compression method during RESTORE.

Restoring a table decompresses and then reimports the table. You must have enough disk space for both the new table and the archived table, as well as enough scratch space to decompress the archive and reimport it.

The file path must be enclosed in single quotes.

You can also restore a table from archives stored in S3-compatible endpoints:

RESTORE TABLE <table> FROM '<S3_file_URL>' 
  WITH (compression = '<compression_program>', 
        s3_region = '<region>', 
        s3_access_key = '<access_key>', 
        s3_secret_key = '<secret_key>', 
        s3_session_token = '<session_token>');

s3_region is required. All features discussed in the S3 import documentation, such as custom S3 endpoints and server privileges, are supported.

Restoring a table locks writes to that table. Concurrent reads are supported, but you cannot import to a table that is being restored.
The RESTORE command is not supported on distributed configurations.
You must have a least GRANT CREATE ON DATABASE privilege level to use the RESTORE command.

Do not attempt to use RESTORE TABLE with a table dump created using a release of HEAVY.AI that is higher than the release running on the server where you will restore the table.

Examples

Restore table tweets from /opt/archive/tweetsBackup.gz:

RESTORE TABLE tweets FROM '/opt/archive/tweetsBackup.gz' 
   WITH (COMPRESSION='gzip');

Restore table tweets from a public S3 file or using server privileges (with the allow-s3-server-privileges server flag enabled):

RESTORE TABLE tweets FROM 's3://my-s3-bucket/archive/tweetsBackup.gz'
   WITH (compression = 'gzip', 
      s3_region = 'us-east-1');

Restore table tweets from a private S3 file using AWS access keys:

RESTORE TABLE tweets FROM 's3://my-s3-bucket/archive/tweetsBackup.gz' 2 
   WITH (compression = 'gzip', 
      s3_region = 'us-east-1', 
      s3_access_key = 'xxxxxxxxxx', s3_secret_key = 'yyyyyyyyy');

Restore table tweets from a private S3 file using temporary AWS access keys/session token:

RESTORE TABLE tweets FROM 's3://my-s3-bucket/archive/tweetsBackup.gz' 
   WITH (compression = 'gzip', 
      s3_region = 'us-east-1', 
      s3_access_key = 'xxxxxxxxxx', s3_secret_key = 'yyyyyyyyy',
      s3_session_token = 'zzzzzzzz');

Restore table tweets from an S3-compatible endpoint:

RESTORE TABLE tweets FROM 's3://my-gcp-bucket/archive/tweetsBackup.gz' 2 
   WITH (compression = 'gzip', 
      s3_region = 'us-east-1', 
      s3_endpoint = 'storage.googleapis.com');

TRUNCATE TABLE

TRUNCATE TABLE <table>;

Use the TRUNCATE TABLE statement to remove all rows from a table without deleting the table structure.

This releases table on-disk and memory storage and removes dictionary content unless it is a shared dictionary. (See the note regarding disk space reclamation.)

Removing rows is more efficient than using DROP TABLE. Dropping followed by recreating the table invalidates dependent objects of the table requiring you to regrant object privileges. Truncating has none of these effects.

Example

TRUNCATE TABLE tweets;

When you DROP or TRUNCATE, the command returns almost immediately. The directories to be purged are marked with the suffix \_DELETE_ME_. The files are automatically removed asynchronously.

In practical terms, this means that you will not see a reduction in disk usage until the automatic task runs, which might not start for up to five minutes.

You might also see directory names appended with \_DELETE_ME_. You can ignore these, with the expectation that they will be deleted automatically over time.

OPTIMIZE TABLE

OPTIMIZE TABLE [<table>] [WITH (VACUUM='true')]

Use this statement to remove rows from storage that have been marked as deleted via DELETE statements.

When run without the vacuum option, the column-level metadata is recomputed for each column in the specified table. HeavyDB makes heavy use of metadata to optimize query plans, so optimizing table metadata can increase query performance after metadata widening operations such as updates or deletes. If the configuration parameter enable-auto-metadata-update is not set, HeavyDB does not narrow metadata during an update or delete — metadata is only widened to cover a new range.

When run with the vacuum option, it removes any rows marked "deleted" from the data stored on disk. Vacuum is a checkpointing operation, so new copies of any vacuum records are deleted. Using OPTIMIZE with the VACUUM option compacts pages and deletes unused data files that have not been repopulated.

Beginning with Release 5.6.0, OPTIMIZE should be used infrequently, because UPDATE, DELETE, and IMPORT queries manage space more effectively.

VALIDATE

VALIDATE

Performs checks for negative and inconsistent epochs across table shards for single-node configurations.

If VALIDATE detects epoch-related issues, it returns a report similar to the following:

heavysql> validate;
Result

Negative epoch value found for table "my_table". Epoch: -1.
Epoch values for table "my_table_2" are inconsistent:
Table Id  Epoch     
========= ========= 
4         1         
5         2

If no issues are detected, it reports as follows:

Instance OK

VALIDATE CLUSTER

VALIDATE CLUSTER [WITH (REPAIR_TYPE = ['NONE' | 'REMOVE'])];

Perform checks and report discovered issues on a running HEAVY.AI cluster. Compare metadata between the aggregator and leaves to verify that the logical components between the processes are identical.

VALIDATE CLUSTER also detects and reports issues related to table epochs. It reports when epochs are negative or when table epochs across leaf nodes or shards are inconsistent.

Examples

If VALIDATE CLUSTER detects issues, it returns a report similar to the following:

mapd@thing3 ~]$ /mnt/gluster/dist_mapd/mapd-sw2/bin/mapdql -p HyperInteractive
User admin connected to database heavyai
heavysql> validate cluster;
Result
 Node          Table Count 
 ===========   =========== 
 Aggregator     1116
 Leaf 0         1114
 Leaf 1         1114
No matching table on Leaf 0 for Table cities_dtl_POINTS table id 56
No matching table on Leaf 1 for Table cities_dtl_POINTS table id 56
No matching table on Leaf 0 for Table cities_dtl table id 80
No matching table on Leaf 1 for Table cities_dtl table id 80
Table details don't match on Leaf 0 for Table view_geo table id 95
Table details don't match on Leaf 1 for Table view_geo table id 95

If no issues are detected, it will report as follows:

Cluster OK

You can include the WITH(REPAIR_TYPE) argument. (REPAIR_TYPE='NONE') is the same as running the command with no argument. (REPAIR_TYPE='REMOVE') removes any leaf objects that have issues. For example:

VALIDATE CLUSTER WITH (REPAIR_TYPE = 'REMOVE');

Epoch Issue Example

This example output from the VALIDATE CLUSTER command on a distributed setup shows epoch-related issues:

heavysql> validate cluster;
Result

Negative epoch value found for table "my_table". Epoch: -16777216.
Epoch values for table "my_table_2" are inconsistent:
Node      Table Id  Epoch     
========= ========= ========= 
Leaf 0    4         1         
Leaf 1    4         2

Release Notes

Release notes for currently supported releases

Use of HEAVY.AI is subject to the terms of the HEAVY.AI End User License Agreement (EULA).

The latest release of HEAVY.AI is 7.2.4.

Currently Supported Releases

7.2.3 | 7.2.2 | 7.2.1 | 7.2.0 | 7.1.2 | 7.1.1 | 7.1.0 | 7.0.2 | 7.0.1 | 7.0.0 | 6.4.4 | 6.4.3 | 6.4.2 | 6.4.1 | 6.4.0 | 6.2.7 | 6.2.5 | 6.2.4 | 6.2.1 | 6.2.0

For release notes for releases that are no longer supported, as well as links to documentation for those releases, see Archived Release Notes.

As with any software upgrade, it is important to back up your data before you upgrade HEAVY.AI. In addition, we recommend testing new releases before deploying in a production environment.

For assistance during the upgrade process, contact HEAVY.AI Support by logging a request through the HEAVY.AI Support Portal.

Release 7.x.x - Important Information

IMPORTANT - In HeavyDB Release 7.x.x, the “render groups” mechanism, part of the previous implementation of polygon rendering, has been removed. When you upgrade to HeavyDB Release 7.x.x, all existing tables that have a POLYGON or MULTIPOLYGON geo column are automatically migrated to remove a hidden column containing "render groups" metadata.

This operation is performed on all tables in all catalogs at first startup, and the results are recorded in the INFO log.

Once a table has been migrated in this manner, it is not backwards-compatible with earlier versions of HeavyDB. If you revert to an earlier version, the table may appear to have missing columns and behavior will be undefined. Attempting to query or render the POLYGON or MULTIPOLYGON data with the earlier version may fail or cause a server crash.

As always, HEAVY.AI strongly recommends that all databases be backed up, or at the very least, dumps are made of tables with POLYGON or MULTIPOLYGON columns using the existing HeavyDB version, before upgrading to HeavyDB Release 7.x.x.

Dumps of POLYGON and MULTIPOLYGON tables made with earlier versions can still be restored into HeavyDB Release 7.x.x. The superfluous metadata is automatically discarded. However, dumps of POLYGON and MULTIPOLYGON tables made with HeavyDB Release 7.x.x are not backwards-compatible with earlier versions.

This applies only to tables with POLYGON or MULTIPOLYGON columns. Tables that contain other geo column types (POINT, LINESTRING, etc.), or only non-geo column types, do not require migration and remain backwards-compatible with earlier relea

For Ubuntu installations, install libncurses5 with the following command:

sudo apt install libncurses5

Release 7.2.4 - March 20, 2024

HeavyDB - Fixed Issues

Adds a new option for enabling or disabling the use of virtual addressing when accessing an S3 compatible endpoint for import or HeavyConnect.
Improves logging related to system locks.

Heavy Immerse - Fixed Issues

Fixes issue with SAML authentication.

Release 7.2.3 - February 5, 2024

HeavyDB - New Features and Improvements

Improves performance of foreign tables that are backed by Parquet files in AWS S3.
Improves logging related to GPU memory allocations and data transfers.

HeavyDB - Fixed Issues

Fixes a crash that could occur for certain query patterns with intermediate geometry projections.
Fixes a crash that could occur for certain query patterns containing IN operators with string function operands.
Fixes a crash that could occur for equi join queries that use functions as operands.
Fixes an intermittent error that could occur in distributed configurations when executing count distinct queries.
Fixes an issue where certain query patterns with LIMIT and OFFSET clauses could return wrong results.
Fixes a crash that could occur for certain query patterns with left joins on Common Table Expressions.
Fixes a crash that could occur for certain queries with window functions containing repeated window frames.

Heavy Render - Fixed Issues

Fix several crashes that could occur during out-of-gpu memory error recovery

Heavy Immerse - Fixed Issues

Fixed dashboard load error when switching tabs.
Fixed table reference in size measure of a client-side join data source for linemap chart.
Fixed client-side join name reference.

Release 7.2.2 - December 15, 2023

HeavyDB - New Features and Improvements

Adds support for output/result set buffer allocations via the "cpu-buffer-mem-bytes" configured CPU memory buffer pool. This feature can be enabled using the "use-cpu-mem-pool-for-output-buffers" server configuration parameter.
Adds a "ndv-group-estimator-multiplier" server configuration parameter that determines how the number of unique groups are estimated for specific query patterns.
Adds "default-cpu-slab-size" and "default-gpu-slab-size" server configuration parameters that are used to determine the default slab allocation size. The default size was previously based on the "max-cpu-slab-size" and "max-gpu-slab-size" configuration parameters.
Improves memory utilization when querying the "dashboards" system table.
Improves memory utilization in certain cases where queries are retried on CPU.
Improves error messages that are returned for some unsupported correlated subquery use cases.

HeavyDB - Fixed Issues

Fixes an issue where allocations could go beyond the configured "cpu-buffer-mem-bytes" value when fetching table chunks.
Fixes a crash that could occur when executing concurrent sort queries.
Fixes a crash that could occur when invalid geometry literals are passed to ST functions.

Heavy Immerse - Fixed Issues

Fix for rendering a gauge chart using a parameterized source (join sources, custom sources).

Release 7.2.1 - December 4, 2023

HeavyDB - New Features and Improvements

Improves instrumentation around Parquet import and HeavyConnect.

HeavyDB - Fixed Issues

Fixes a crash that could occur for join queries that result in many bounding box overlaps.
Fixes a crash that could occur in certain cases for queries containing an IN operator with a subquery parameter.
Fixes an issue where the ST_POINTN function could return wrong results when called with negative indexes.
Fixes an issue where a hang could occur while parsing a complex query.

Heavy Render - Fixed Issues

Fixed error when setting render-mem-bytes greater than 4gb.

Heavy Immerse - Fixed Issues

Clamp contour interval size on the Contour Chart to prevent a modulo operation error.
Filter outlier values in the Contour Chart that skew color range.
Fixed sample ratio query ordering to address a pointmap rendering issue.
Fixed layer naming in the Hide Layer menu.

Release 7.2.0 - November 16, 2023

HeavyDB - New Features and Improvements

Adds support for URL_ENCODE, URL_DECODE, REGEXP_COUNT, and HASH string functions.
Enables log based system tables by default.
Adds support for log based system tables auto refresh behind a flag (Beta).
Improves the pre-flight query row count estimation process for projection queries without filters.
Improves the performance of the LIKE operator.

HeavyDB - Fixed Issues

General

Fixes errors that could occur when the REPLACE clause is applied to SQL DDL commands that do not support it.
Fixes an issue where the HeavyDB startup script could ignore command line arguments in certain cases.
Fixes a crash that could occur when requests were made to the detect_column_types API for Parquet files containing list columns.
Fixes a crash that could occur in heavysql when the \detect command is executed for Parquet files containing string list columns.
Fixes a crash that could occur when attempting to cast to text column types in SELECT queries.
Fixes a crash that could occur in certain cases where window functions were called with literal arguments.
Fixes a crash that could occur when executing the ENCODE_TEXT function on NULL values.
Fixes an issue where queries involving temporary tables could return wrong results due to incorrect cache invalidation.

Geo

Fixes an issue where the ST_Distance function could return wrong results when at least one of its arguments is NULL.
Fixes an issue where the ST_Point function could return wrong results when the "y" argument is NULL.
Fixes an issue where the ST_NPoints function could return wrong results for NULL geometries.
Fixes a crash that could occur when the ST_PointN function is called with out-of-bounds index values.
Fixes an issue where the ST_Intersects and ST_Contains functions could incorrectly result in loop joins based on table order.
Fixes an issue where the ST_Transform function could return wrong results for NULL geometries.
Fixes an error that could occur for tables with polygon columns created from the output of user-defined table functions.

Heavy Immerse - New Features and Improvements

[Beta] Geo Joins - Immerse now supports “contains” and “intersects” conditions for common geometry combinations when creating a join datasource in the no code join editor.
Join datasource crossfilter support: Charts that use single table data sources will now crossfilter and be crossfiltered by charts that use join data sources.
Layer Drawer - In layered map charts immerse now has a quick to access Layer Drawer, which allows for layer toggling, reordering, renaming, opacity, zoom visibility controls.
Zoom to filters - Map charts in immerse now support “zoom to filters” functionality, either on an individual chart layer (via the Layer Drawer) or on the whole chart.
Image support in map rollovers - URLs pointing to images will automatically be rendered as a scaled image, with clickthrough support to the full size image.

Heavy Immerse - Fixed Issues

Choropleth/Line Map join datasource support - Significantly improves performance in Choropleth and Line Map charts when using join data sources. Auto aggregates measures on geometry.
Fixes issue where sql editor will horizontally scroll with long query strings

Release 7.1.2 - October 4, 2023

HeavyDB - New Features and Improvements

Improves how memory is allocated for the APPROX_MEDIAN aggregate function.

HeavyDB - Fixed Issues

Fixes a crash that could occur when the DISTINCT qualifier is specified for aggregate functions that do not support the distinct operation.
Fixes an issue where wrong results could be returned for queries with window functions that return null values.
Fixes a crash that could occur in certain cases where queries have multiple aggregate functions.
Fixes a crash that could occur when tables are created with invalid options.
Fixes a potential data race that could occur when logging cache sizes.

Release 7.1.1 - September 15, 2023

HeavyDB - New Features and Improvements

Adds an EXPLAIN CALCITE DETAILED command that displays more details about referenced columns in the query plan.
Improved logging around system memory utilization for each query.
Adds an option to SQLImporter for disabling logging of connection strings.
Adds a "gpu-code-cache-max-size-in-bytes" server configuration parameter for limiting the amount of memory that can be used by the GPU code cache.
Improves column name representation in Parquet validation error messages.

HeavyDB - Fixed Issues

Fixes a parser error that could occur for queries containing a NOT ILIKE clause.
Fixes a multiplication overflow error that could occur when retrying queries on CPU.
Fixes an issue where table dumps do not preserve quoted column names.
Fixes a "cannot start a transaction within a transaction" error that could occur in certain cases.
Fixes a crash that could occur for certain query patterns involving division by COUNT aggregation functions.
Removes a warning that is displayed on server startup when HeavyIQ is not configured.
Removes spurious warnings for CURSOR type checks when there are both cursor and scalar overloads for a user-defined table function.

Heavy Render - New Features and Improvements

Adds hit testing support for custom measures that reference multiple tables.

Heavy Immerse - Fixed Issues

Fixes SAML authentication regression in 7.1.0
Fixes chart export regression in 7.1.0

Release 7.1.0 - August 22, 2023

HeavyDB - New Features and Improvements

Geospatial

Exposes new geo overlaps function ST_INTERSECTSBOX for very fast bounding box intersection detections.
Adds support for the max_reject COPY FROM option when importing raster files. This ensures that imports from large multi-file raster datasets continue after minor errors, but provides adjustable notification upon major ones.
Adds a new ST_AsBinary (also aliased as ST_AsWKB) function that returns the Well-Known Binary (WKB) representation of geometry values. This highly-efficient format is used by postGIS newer versions of Geopandas.
Adds a new ST_AsText (also aliased AS ST_AsWKT) function that returns the Well-Known Text (WKT) representation of geometry values. This is less efficient than WKB but compatible even with nonspatial databases.
Adds support for loading geometry values using the load_table_binary_arrow Thrift API.
New version of HeavyAI python library with direct Geopandas support.
New version of rbc-project with geo column support allowing extensions which input or output any geometric type.

Core SQL

New JAROWINKLER_SIMILARITY string operator for fuzzy matching between string columns and values. This is a case insensitive measure including edit transitions and (slightly) sensitive to white space.
New LEVENSHTEIN_DISTANCE string operator for fuzzy matching between string columns and values. This is case insensitive and represents the number of edits needed to make two strings identical. An “edit” is defined by either an insertion of a character, a deletion of a character, or a replacement of a character.
Extends the ALTER COLUMN TYPE command to support string dictionary encoding size reduction.
Improves the error message returned when out of bound values are inserted into FLOAT and DOUBLE columns.
Adds a "watchdog-max-projected-rows-per-device" server configuration parameter and query hint that determines the maximum number of rows that can be projected by each GPU and CPU device.
Adds a "preflight-count-query-threshold" server configuration parameter and query hint that determines the threshold at which the preflight count query optimization should be executed.
Optimizes memory utilization for projection queries on instances with multiple GPUs.

Predictive Modeling with HeavyML

Support for PCA models and PCA_PROJECT operator.
Support SHOW MODEL FEATURE DETAILS to show per-feature info for models, including regression coefficients and variable importance scores, if applicable.
Support for TRAIN_FRACTION option to specify proportion of the input data to a CREATE MODEL statement that should be trained on.
Support creation of models with only categorical predictors.
Enable categorical and numeric predictors to be specified in any order for CREATE MODEL statements and subsequent inference operations.
Enable Torch table functions (requires client to specify libtorch.so).
Add tf_torch_raster_object_detect for raster object detections (requires client to specify libtorch.so and provide trained model in torchscript format).

Extensions Framework

Allow Array literals as arguments to scalar UDFs
Support table function (UDTF) output row sizes up to 16 trillion rows
Adds support for Column<TextEncodingNone> and ColumnList<TextEncodingNone> table function inputs and outputs.

Performance Optimizations

SQL projections now are sized per GPU/CPU core instead of globally, meaning that projections are more memory efficient as a function of the number of GPUs/CPU threads used for a query. In particular, this means that various forms of in-situ rendering, for example, non-grouped pointmaps, renders can scale to N GPUs more points or use N GPUs less memory, depending on the configuration.
Better parallelize construction of metadata for subquery results for improved performance
Enables result set caching for queries with LIMIT clauses.
Enables the bounding box intersection optimization for certain spatial join operators and geometry types by default.

HeavyDB - Fixed Issues

Fix potential crash when concatenating strings with the output of a UDF.
Fixes an issue where deleted rows with malformed data can prevent ALTER COLUMN TYPE command execution.
Fixes an error that could occur when parsing odbcinst.ini configuration files containing only one installed driver entry.
Fixes a table data corruption issue that could occur when the server crashes multiple times while executing write queries.
Fixes a crash that could occur when attempting to do a union of a string dictionary encoded text column and a none encoded text column.
Fixes a crash that could occur when the output of a table function is used as an argument to the strtok_to_array function.
Fixes a crash that could occur for queries involving projections of both geometry columns and geometry function expressions.
Fixes an issue where wrong results could be returned when the output of the DATE_TRUNC function is used as an argument to the count distinct function.
Fixes an issue where an error occurs if the COUNT_IF function is used in an arithmetic expression.
Fixes a crash that could occur when the WIDTH_BUCKET function is called with decimal columns.
Fixes an issue where the WIDTH_BUCKET function could return wrong results when called with decimal values close to the upper and lower boundary values.
Fixes a crash that could occur for queries with redundant projection steps in the query plan.

Heavy Render - Fixed Issues

Fixes a crash that could occur on multi-gpu systems while handling an out of GPU memory error.

Heavy Immerse - New Features and Improvements

Zoom to filters, setting map bounding box to extent of current filter set.
Image preview in map chart popups where image URLs are present.

Heavy Immerse - Fixed Issues

Fixed error thrown by choropleth chart on polygon hover.

Release 7.0.2 - June 28, 2023

HeavyDB - New Features and Improvements

Adds support for nested window function expressions.
Adds support for exception propagation from table functions.

HeavyDB - Fixed Issues

Fixes a crash that could occur when accessing 8-bit or 16-bit string dictionary encoded text columns on ODBC backed foreign tables.
Fixes unexpected GPU execution and memory allocations that could occur when executing sort queries with the CPU mode query hint.
Fixes an issue that could occur when inserting empty strings for geometry columns.
Fixes an issue that could occur when out of bounds fragment sizes are specified on table creation.
Fixes an issue where system dashboards could contain unexpected cached data.
Fixes a crash that could occur when executing aggregate functions over the result of join operations on scalar subqueries.
Fixes a server hang that could occur when GPU code compilation errors occur for user-defined table functions.
Fixes a data race that could occur when logging query plan cache size.

Heavy Render - New Features and Improvements

Add support for rendering 1D “terrain” cross-section overlays.
Rewrite 2D cross-section mesh generation as a table function.
Further improvements to system state logging when a render out of memory error occurs, and move it to the ERROR log for guaranteed visibility.
Enable auto-clear-render-mem by default for any render-vega call taking < 10 seconds.

Heavy Render - Fixed Issues

Render requests with 0 width or height could lead to a CHECK failure in encodePNG. Invalid image sizes now throw a non-fatal error during vega parsing.

Heavy Immerse - New Features and Improvements

Visualize terrain at the base of atmospheric cross sections in the Cross Section chart with the new Base Terrain chart layer type.

Heavy Immerse - Fixed Issues

Fixed local timezone issue with Chart Animation using cross filter replay.

Release 7.0.1 - June 8, 2023

HeavyDB - New Features and Improvements

Improves instrumentation around CPU and GPU memory utilization and certain crash scenarios.

HeavyDB - Fixed Issues

Fixes a crash that could occur for GPU executed join queries on dictionary encoded text columns with NULL values.

Heavy Render - New Features and Improvements

Improve instrumentation and logging related to gpu memory utilization, particularly with polygon rendering, as well as command timeout issues

Heavy Render - Fixed Issues

Fix a potential segfault when a Vulkan device lost error occurs

Release 7.0.0 - May 1, 2023

HeavyDB - New Features and Improvements

IMPORTANT - In HeavyDB Release 7.0, the “render groups” mechanism, part of the previous implementation of polygon rendering, has been removed. When you upgrade to HeavyDB Release 7.0, all existing tables that have a POLYGON or MULTIPOLYGON geo column are automatically migrated to remove a hidden column containing "render groups" metadata.

This operation is performed on all tables in all catalogs at first startup, and the results are recorded in the INFO log.

Dumps of POLYGON and MULTIPOLYGON tables made with earlier versions can still be restored into HeavyDB Release 7.0. The superfluous metadata is automatically discarded. However, dumps of POLYGON and MULTIPOLYGON tables made with HeavyDB Release 7.0 are not backwards-compatible with earlier versions.

For Ubuntu installations, install libncurses5 with the following command:

sudo apt install libncurses5

Adds new Executor Resource Manager enabling parallel CPU and CPU-GPU query execution, and support for CPU execution on data inputs larger than fit in memory.
Adds HeavyML, a suite of machine learning capabilities accessible directly in SQL, including support for linear regression, random forest, gradient boosted trees, and decision tree regression models, and KMeans and DBScan clustering methods. (BETA)
Adds HeavyConnect support for MULTIPOINT and MULTILINESTRING columns.
Adds ALTER COLUMN TYPE support for text columns.
Adds a REASSIGN ALL OWNED command that allows for object ownership change across all databases.
Adds an option for validating POLYGON and MULTIPOLYGON columns when importing using the COPY FROM command or when using HeavyConnect.
Adds support for CONDITIONAL_CHANGE_EVENT window function.
Adds support for automatic casting of table function CURSOR arguments.
Adds support for Column<GeoMultiPolygon>, Column<GeoMultiLineString>, and Column<GeoMultiPoint> table function inputs and outputs.
Adds support for none encoded text column, geometry column, and array column projections from the right table in left join queries.
Adds support for literal text scalar subqueries.
Adds support for ST_X and ST_Y function output cast to text.
Improves concurrent execution of DDL and SHOW commands.
Improves error messaging for when the storage directory is missing.
Optimizes memory utilization for auto-vacuuming after delete queries.

HeavyDB - Fixed Issues

Fixes an issue where the root user could be deleted in certain cases.
Fixes an issue where staging directories for S3 import could remain when imports failed.
Fixes a crash that could occur when accessing the "tables" system table on instances containing tables with many columns.
Fixes a crash that could occur when accessing CSV and regex parsed file foreign tables that previously errored out during cache recovery.
Fixes an issue where dumping table foreign tables would produce an empty table.
Fixes an intermittent crash that could occur when accessing CSV and regex parsed file foreign tables that are backed by large files.
Fixes a "Ran out of slots in the query output buffer" exception that could occur when using stale cached cardinality values.
Fixes an issue where user defined table functions are erroneously categorized as ambiguous.
Fixes an error that could occur when a group by clause includes an alias that matches a column name.
Fixes a crash that could occur on GPUs with the Pascal architecture when executing join queries with case expression projections.
Fixes a crash that could occur when using the LAG_IN_FRAME window function.
Fixes a crash that could occur when projecting geospatial columns from the tf_raster_contour_polygons table function.
Fixes an issue that could occur when calling window functions on encoded date columns.
Fixes a crash that could occur when the coalesce function is called with geospatial or array columns.
Fixes a crash that could occur when projecting case expressions with geospatial or array columns.
Fixes a crash that could occur due to rounding error when using the WIDTH_BUCKET function.
Fixes a crash that could occur in certain cases where left join queries are executed on GPU.
Fixes a crash that could occur for queries with joins on encoded date columns.
Fixes a crash that could occur when using the SAMPLE function on a geospatial column.
Fixes a crash that could occur for table functions with cursor arguments that specify no field type.
Fixes an issue where automatic casting does not work correctly for table function calls with ColumnList input arguments.
Fixes an issue where table function argument types are not correctly inferred when arithmetic operations are applied.
Fixes an intermittent crash that could occur for join queries due to a race condition when changing hash table layouts.
Fixes an out of CPU memory error that could occur when executing a query with a count distinct function call on a high cardinality column.
Fixes a crash that could occur when running a HeavyDB instance in read-only mode after previously executing write queries on tables.
Fixes an issue where the auto-vacuuming process does not immediately evict chunks that were pulled in for vacuuming.
Fixes a crash that could occur in certain cases when HeavyConnect is used with Parquet files containing null string values.
Fixes potentially inaccurate calculation of vertical attenuation from antenna patterns in HeavyRF.

Heavy Render - New Features and Improvements

Add support for rendering a 1d cross-section as a line
Package the Vulkan loader libVulkan1 alongside heavydb

Heavy Render - Fixed Issues

Fix a device lost error that could occur with complex polygon renders

Heavy Immerse - New Features and Improvements

Data source Joins as a new custom data source type. (BETA)
Adds improved query performance defaults for the Contour Chart.
Adds access to new control panel to users with role "immerse_control_panel", even if the user is not a superuser.
Adds custom naming of map layers.
Adds custom map layer limit option using flag “ui/max_map_layers” which can be set explicitly (defaults to 8) or to -1 to remove the limit.

Heavy Immerse - Fixed Issues

Renames role from “immerse_trial_mode” to “immerse_export_disabled” and renames corresponding flag from “ui/enable_trial_mode” to “ui/user_export_disabled”.
Various minor UI fixes and polishing.
Fixes an issue where changing parameter value causes Choropleth popup to lose selected popup columns.
Fixes an issue where changing parameter value causes Pointmap to lose selected popup columns.
Fixes an issue where building a Skew-T chart results in a blank browser page.
Fixes an issue where Skew-T chart did not display wind barbs.
Fixes an issue with default date and time formatting.
Fixes an issue where setting flag "ui/enable_map_exports" to false unexpectedly disabled table chart export.
Fixes an issue with date filter presets.
Fixes an issue where filters "Does Not Contain" or "Does not equal" did not work on Crosslinked Columns.
Fixes an issue where charts were not redrawing to show the current bounding box filter set by the Linemap chart.

Release 6.4.4 - May 2, 2023

HeavyDB - New Features and Improvements

Adds support for literal text scalar subqueries.

HeavyDB - Fixed Issues

Fixes a crash that could occur due to rounding error when using the WIDTH_BUCKET function.
Fixes a crash that could occur for queries with joins on encoded date columns.
Fixes a crash that could occur when running a HeavyDB instance in read-only mode after previously executing write queries on tables.
Fixes an issue where the auto-vacuuming process does not immediately evict chunks that were pulled in for vacuuming.

Heavy Immerse - Fixed Issues

Fixed issue where Skew-T chart would not render when nulls were used in selected data.
Fixed issue where wind barbs were not visible on Skew-T chart.

Release 6.4.3 - February 27, 2023

Heavy Immerse - New Features and Improvements

Added feature flag ui/session_create_timeout with a default value of 10000 (10 seconds) for modifying login request timeout.

Release 6.4.2 - February 15, 2023

HeavyDB - New Features and Improvements

Adds the HeavyDB server configuration parameter enable-foreign-table-scheduled-refresh for enabling or disabling automated foreign table scheduled refreshes..

HeavyDB - Fixed Issues

Fixes a crash that could occur when S3 CSV-backed foreign tables with append refreshes are refreshed multiple times.
Fixes a crash that could occur when foreign tables with geospatial columns are refreshed after cache evictions.
Fixes a crash that could occur when querying foreign tables backed by Parquet files with empty row groups.
Fixes an error that could occur when select queries used in ODBC foreign tables reference case sensitive column names.
Fixes a crash that could occur when CSV backed foreign tables with geospatial columns are refreshed without updates to the underlying CSV files.
Fixes a crash that could occur in heavysql when executing the \detect command with geospatial files.
Fixes a casting error that could occur when executing left join queries.
Fixes a crash that could occur when accessing the disk cache on HeavyDB servers with the read-only configuration parameter enabled.
Fixes an error that could occur when executing queries that project geospatial columns.
Fixes a crash that could occur when executing the EXTRACT function with the ISODOW date_part parameter on GPUs.
Fixes an error that could occur when importing CSV or Parquet files with text columns containing more than 32,767 characters into HeavyDB NONE ENCODED text columns.

Heavy Render - Fixed Issues

Fixes a Vulkan Device Lost error that could occur when rendering complex polygon data with thousands of polygons in a single pixel.

Release 6.4.1 - January 30, 2023

HeavyDB - New Features and Improvements

Optimizes result set buffer allocations for CPU group by queries.
Enables trimming of white spaces in quoted fields during CSV file imports, when both the trim_spaces and quoted options are set.

HeavyDB - Fixed Issues

Fixes an error that could occur when importing CSV files with quoted fields that are surrounded by white spaces.
Fixes a crash that could occur when tables are reordered for range join queries.
Fixes a crash that could occur for join queries with intermediate projections.
Fixes a crash that could occur for queries with geospatial join predicate functions that use literal parameters.
Fixes an issue where queries could intermittently and incorrectly return error responses.
Fixes an issue where queries could return incorrect results when filter push-down through joins is enabled.
Fixes a crash that could occur for queries with join predicates that compare string dictionary encoded and nonencoded text columns.
Fixes an issue where hash table optimizations could ignore the max-cacheable-hashtable-size-bytes and hashtable-cache-total-bytes server configuration parameters.
Fixes an issue where sharded table join queries that are executed on multiple GPUs could return incorrect results.
Fixes a crash that could occur when sharded table join queries are executed on multiple GPUs with the from-table-reordering server configuration parameter enabled.

Heavy Immerse - New Features and Improvements

Multilayer support for Contour and Windbarb charts.
Enable Contour charts by default (feature flag: ui/enable_contour_chart).
Support custom SQL measures in Contour charts.
Restrict export from Heavy Immerse by enabling trial mode (feature flag: ui/enable_trial_mode). Trial mode enables a super user to restrict export capabilities for users who have the immerse_trial_mode role.

Heavy Immerse - Fixed Issues

Allow MULTILINESTRING to be used in selectors for Linemap charts.
Allow MULTILINESTRING to be used in Immerse SQL Editor.

Release 6.4.0 - December 16, 2022

This release features general availability of data connectors for PostGreSQL, beta Immerse connectors for Snowflake and Redshift, and SQL support for Google BigQuery and Hive (beta). These managed data connections let you use HEAVY.AI as an acceleration platform wherever your source data may live. Scheduling and automated caching ensure that fast analytics are always running on the latest available data.

Immerse features four new chart types: Contour, Cross-section, Wind barb, and Skew-t. While especially useful for atmospheric and geotechnical data visualization, Contour and Cross-section also have more general application.

Major improvements for time series analysis have been added. This includes an Immerse user interface for time series, and a large number of SQL window function additions and performance enhancements.

The release also includes two major architectural improvements:

The ability to perform cross-database queries, both in SQL and in Immerse, increasing flexibility across the board. For example, you can now easily build an Immerse dashboard showing system usage combined with business data. You might also make a read-only database of data shared across a set of users.
Render queries no longer block other GPU queries. In many use cases, renders can be significantly slower than other common queries. This should result in significant performance gains, particularly in map-heavy dashboards.

HeavyDB - New Features and Improvements

Core SQL

Adds support for cross database SELECT, UPDATE, and DELETE queries.
Support for MODE SQL aggregate.
Add support for strtok_to_array.
Support for ST_NumGeometries().
Support ST_TRANSFORM applied to literal geo types.
Enhanced query tracing ensures all child operations for a query_id are properly logged with that ID.

Data Linking and Import

Adds support for BigQuery and Hive HeavyConnect and import.
Adds support for table restore from S3 archive files.
Improves integer column type detection in Snowflake import/HeavyConnect data preview.
Adds HeavyConnect and import support for Parquet required scalar fields.
Improves import status error message when an invalid request is made.

Table Function Enhancements

Support POINT, LINESTRING, and POLYGON input and output types in table functions.
Support default values for scalar table function arguments.
Add tf_raster_contour table function to generate contours given x, y, and z arguments. This function is exposed in Immerse, but has additional capabilities available in SQL, such as supporting floating point contour intervals.
Return file path and file name from tf_point_cloud_metadata table function.
Previous length limit of 32K characters per values for none-encoded text columns has been lifted, and now none-encoded text values can be up to 2^31 - 1 characters (approximately 2.1billion characters).
Support array column outputs from table functions.
Add TEXT ENCODING DICT and Array<TEXT ENCODING DICT> type support for runtime functions/UDFs.
Allow transient TEXT ENCODING DICT column inputs into table functions.

Window Function Enhancements

Support COUNT_IF function.
Support SUM_IF function.
Support NTH_VALUE window function.
Support NTH_VALUE_IN_FRAME window function.
Support FIRST_VALUE_IN_FRAME and LAST_VALUE_IN_FRAME window functions.
Support CONDITIONAL_TRUE_EVENT.
Support ForwardFill and BackwardFill window functions to fill in missing (null) values based on previous non-null values in window.

HeavyDB - Fixed Issues

Fixes an issue where databases with duplicate names but different capitalization could be created.
Fixes an issue where raster imports could fail due to inconsistent band names.
Fixes an issue that could occur when DUMP/RESTORE commands were executed concurrently.
Fixes an issue where certain session updates do not occur when licenses are updated.
Fixes an issue where import/HeavyConnect data preview could return unsupported decimal types.
Fixes an issue where import/HeavyConnect data preview for PostgreSQL queries involving variable length columns could result in an error.
Fixes an issue where NULL elements in array columns with the NOT NULL constraint were not projected correctly.
Fixes a crash that could occur in certain scenarios where UPDATE and DELETE queries contain subqueries.
Fixes an issue where ingesting ODBC unsigned SQL_BIGINT into HeavyDB BIGINT columns using HeavyConnect or import could result in storage of incorrect data.
Fixes a crash that could occur in distributed configurations, when switching databases and accessing log based system tables with rolled off logs.
Fixes an error that occurred when importing Parquet files that did not contain statistics metadata.
Ensure query hint is propagated to subqueries.
Fix crash that could occur when LAG_IN_FRAME or LEAD_IN_FRAME were missing order-by or frame clause.
Fix bug where LAST_VALUE window function could return wrong results.
Fix issue where “Cannot use fast path for COUNT DISTINCT” could be reported from a count distinct operation.
Various bug fixes for support of VALUES() clause.
Improve handling of generic input expressions for window aggregate functions.
Fix bug where COUNT(*) and COUNT(1) over window frame could cause crash.
Fix wrong coordinate used for origin_y_bin in tf_raster_graph_shortest_slope_weighted_path.
Speed up table function binding in cases with no ColumnList arguments.
Support arrays of transient encoded strings into table functions.

Heavy Render - New Features and Improvements

Render queries no longer block parallel execution queue for other queries.

Heavy Immerse - New Features and Improvements

The Immerse PostgreSQL connector is now generally available, and is joined by public betas of Redshift and Snowflake.
New chart types:
- Contour chart. Contours can be applied to any geo point data, but are especially useful when applied to smoothly-varying pressure and elevation data. They can help reveal general patterns even in noisy primary data. Contours can be based on any point data, including that from regular raster grids like a temperature surface, or from sparse points like LiDAR data.
- Cross-section chart. As the name suggests, this allows a new view on 2.5D or 3D datasets, where a selected data dimension is plotted on the vertical axis for a slice of geographic data. In addition to looking in profile at parts of the atmosphere in weather modeling, this can also be used to look at geological sections below terrain.
- Representing vector force fields takes a step forward with the Wind barb plot. Wind barbs are multidimensional symbols which convey at a glance both strength and direction.
- Skew-T is a highly specialized multidimensional chart used primarily by meteorologists. Skew-Ts are heavily used in weather modeling and can help predict, for example, where thunderstorms or dry lightning are likely to occur.
Initial support for window functions in Immerse, enabling time lag analysis in charts. For example, you can now plot month-over-month or quarter-over-quarter sales or web traffic volume.
For categorical data, in addition to supporting aggregations based on the number of unique values, MODE is now supported. This supports the creation of groups based on the most-common value.

Release 6.2.7 - November 1, 2022

HeavyDB - Fixed Issues

Fixed an issue where a restarted server can potentially deadlock if the first two queries are executed at the same time and use different executors.

Release 6.2.5 - October 26, 2022

HeavyDB - Fixed Issues

Fixed an issue where COUNT DISTINCT or APPROX_COUNT_DISTINCT, when run on a CASE statement that outputs literal strings, could cause a crash.

Release 6.2.4 - October 12, 2022

HeavyDB - Fixed Issues

Fixes a crash when using COUNT() or COUNT(1) with the window function, i.e., COUNT(*) OVER (PARTITION BY x).
Fixes an incorrect result when using a date column as a partition key, like SUM(x) OVER (PARTITION BY DATE_COL).
Improves the performance of window functions when a literal expression is used as one of the input expressions of window functions like LAG(x, 1).
Improves query execution preparation phase by preventing redundant processing of the same nodes, especially when a complex input query is evaluated.
Fixes geometry type checking for range join operator that could cause a crash in some cases.
Resolves a query that may return an incorrect result when it has many projection expressions (for example, more than 50 8-byte output expressions) when using a window function expression.
Fixes an issue where the Resultset recycler ignores the server configuration size metrics.
Fixes a race condition where multiple catalogs could be created on initialization, resulting in possible deadlocks, server hangs, increased memory pressure, and slow performance.

Release 6.2.1 - September 27, 2022

HeavyDB - Fixed Issues

Fixes a crash encountered during some SQL queries when the read-only setting was enabled.
Fixes an issue in tf_raster_graph_shortest_slope_weighted_path table function that would lead some inputs to be incorrectly rejected.

6.2.0 - September 23, 2022

In Release 6.2.0, Heavy Immerse adds animation and a control panel system. HeavyConnect now includes connectors for Redshift, Snowflake, and PostGIS. The SQL system is extended with support for casting and time-based window functions. GeoSQL gets direct LiDAR import, multipoints, and multilinestrings, as well as graph network algorithms. Other enhancements include performance improvements and reduced memory requirements across the product.

HeavyDB - New Features and Improvements

SQL Improvements

TRY_CAST support for string to numeric, timestamp, date, and time casts.
Implicit and explicit CAST support for numeric, timestamp, date, and time to TEXT type.
CAST support from Timestamp(0|3|6|9) types to Time(0) type.
Concat (||) operator now supports multiple nonliteral inputs.
JSON_VALUE operator to extract fields from JSON string columns.
BASE64_ENCODE and BASE64_DECODE operators for BASE64 encoding/decoding of string columns.
POSITION operator to extract index of search string from strings.
Add hash-based count distinct operator to better handle case of sparse columns.

Geospatial

Support MULTILINESTRING OGC geospatial type.
Support MULTIPOINT OGC geospatial type.
Support ST_NumGeometries.
Support ST_ConvexHull and ST_ConcaveHull.
Improved table reordering to maximize invocation of accelerated geo joins.
Support ST_POINT, ST_TRANSFORM and ST_SETSRID as expressions for probing columns in point-to-point distance joins.
Support accelerated overlaps hash join for ST_DWITHIN clause comparing two POINT columns.
Support for POLYGON to MULTIPOLYGON promotion in SQLImporter.

Window Functions

RANGE window function FRAME support for Time, Date, and Timestamp types.
Support LEAD_IN_FRAME / LAG_IN_FRAME window functions that compute LEAD / LAG in reference to a window frame.

Extension Functions

Add TextEncodingNone support for scalar UDF and extension functions.
Support array inputs and outputs to table functions.
Support literal interval types for UDTFs.
Add support for table functions range annotations for literal inputs

Performance and Control

Make max CPU threads configurable via a startup flag.
Support array types for Arrow/select_ipc endpoints.
Add support for query hint to control dynamic watchdog.
Add query hint to control Cuda block and grid size for query.
Adds an echo all option to heavysql that prints all executed commands and queries.
Improved decimal precision error messages during table creation.

HeavyConnect

Add support for file roll offs to HeavyConnect local and S3 file use cases.
Add HeavyConnect support for non-AWS S3-compatible endpoints.

Advanced Analytics

LiDAR

Add tf_point_cloud_metadata table function to read metadata from one or more LiDAR/point cloud files, optionally filtered by a bounding box.
Add tf_load_point_cloud table function to load data from one or more LiDAR/point cloud files, optionally filtered by bounding box and optionally cached in memory for subsequent queries.

Graph and Path Functions

Add tf_graph_shortest_path table function to compute shortest edge-weighted path between two points in a graph constructed from an input edge list
Add tf_graph_shortest_paths_distances table function to compute the shortest edge-weighted distances between a starting point and all other points in a graph constructed from an input edge list.
Add tf_grid_graph_shortest_slope_weighted_path table function to compute the shortest slope-weighted path between two points along rasterized data.

Enhanced Spatial Aggregations

Support configurable aggregation types for tf_geo_rasterize and tf_geo_rasterize_slope table functions, allowing for AVG, MIN, MAX, SUM, and COUNT aggregations.
Support two-pass gaussian blur aggregation post-processing for tf_geo_rasterize and tf_geo_rasterize_slope table functions.

RF Propagation Extension Improvements

Add dynamic ray splitting to tf_rf_prop_max_signal table function for improved performance and terrain coverage.
Add variant of tf_rf_prop_max_signal table function that takes per-RF source/tower transmission power (watts) and frequency (MHz).
Add variant of generate_series table function that generates series of timestamps between a start and end timestamp at specified time intervals.

Fixed Issues

ST_Centroid now automatically picks up SRID of underlying geometry.
Fixed a crash that occurred when ST_DISTANCE had an ST_POINT input for its hash table probe column.
Fixed an issue where a query hint would not propagate to a subquery.
Improved overloaded table function type deduction eliminates type mismatches when table function outputs are used downstream.
Properly handle cases of RF sources outside of terrain bounding box for tf_rf_prop_max_signal.
Fixed an issue where specification of unsupported GEOMETRY column type during table creation could lead to a crash.
Fixed a crash that could occur due to execution of concurrent create and drop table commands.
Fixed a crash that could occur when accessing the Dashboards system table.
Fixed a crash that could occur as a result of type mismatches in ITAS queries.
Fixed an issue that could occur due to band name sanitization during raster imports.
Fixed a memory leak that could occur when dropping temporary tables.
Fixed a crash that could occur due to concurrent execution of a select query and long-running write query on the same table.

Heavy Render - New Features and Improvements

Disables render group assignment by default.
Supports rendering of MULTILINESTRING geometries.
Memory footprint required for compositing renders on multi-GPU systems is significantly reduced. Any multi-GPU system will see improvements, but is most noticeable on systems with 4 or more GPUs. For example, rendering a 1400 x 1400 image results in ~450mb of memory saved when using 8 GPUs for a query. Multi-gpu system configurations should be able to set the res-gpu-mem configuration flag value lower as a result, freeing memory for other subsystems.
Adds INFO logging of peak render memory usage for the lifetime of the server process. The render memory logged is peak render query output buffer size (controlled with the render-mem-bytes configuration flag) and peak render buffer usage (controlled with the res-gpu-mem configuration flag). These peaks are logged in the INFO log on server shutdown, when GPU memory is cleared via clear_gpu_memory endpoint, or when a new peak is reached. These logged peaks can be useful to adjust the render-mem-bytes and res-gpu-mem configuration flags to improve memory utilization by avoiding reserving memory that might go unused. Examples of the log messages:
- When a new peak render-mem-bytes is reached: New peak render buffer usage (render-mem-bytes):37206200 of 1000000000
- When a new peak res-gpu-mem is reached: New peak render memory usage (res-gpu-mem): 166033024
- Peaks logged on server shutdown or on clear_gpu_memory: Render memory peak utilization:
  Query result buffer (render-mem-bytes): 37206200 of 1000000000 Images and buffers (res-gpu_mem): 660330240 Total allocated: 1660330240

Heavy Render - Fixed Issues

Fixed an issue the occurred when trying to hit-test a multiline SQL expression.

Heavy Immerse - New Features and Improvements

Dashboard and chart image export
Crossfilter replay
Improved popup support in the base 3D chart
New Multilayer CPU rendered Geo charts: Pointmap, Linemap, and Choropleth (Beta)
Control Panel (Beta)
Redshift, Snowflake, and PostGIS HeavyConnect support (Beta)
Skew-T chart (Beta)
Support for limiting the number of charts in a dashboard through the ui/limit_charts_per_dashboard feature flag. The default value is 0 (no limit).

Heavy Immerse - Fixed Issues

Fixed duplicate column names importer error.
Various bug fixes and user-interface improvements.

Functions and Operators

Functions and Operators (DML)

Basic Mathematical Operators

Operator

Description

+numeric

Returns numeric

–numeric

Returns negative value of numeric

numeric1 + numeric2

Sum of numeric1 and numeric2

numeric1 – numeric2

Difference of numeric1 and numeric2

numeric1 * numeric2

Product of numeric1 and numeric2

numeric1 / numeric2

Quotient (numeric1 divided by numeric2)

Mathematical Operator Precedence

Parenthesization
Multiplication and division
Addition and subtraction

Comparison Operators

Operator

Description

=

Equals

<>

Not equals

>

Greater than

>=

Greater than or equal to

<

Less than

<=

Less than or equal to

BETWEEN x AND y

Is a value within a range

NOT BETWEEN x AND y

Is a value not within a range

IS NULL

Is a value that is null

IS NOT NULL

Is a value that is not null

NULLIF(x, y)

Compare expressions x and y. If different, return x. If they are the same, return null. For example, if a dataset uses ‘NA’ for null values, you can use this statement to return null using SELECT NULLIF(field_name,'NA').

IS TRUE

True if a value resolves to TRUE.

IS NOT TRUE

True if a value resolves to FALSE.

Mathematical Functions

Function

Description

ABS(x)

Returns the absolute value of x

CEIL(x)

Returns the smallest integer not less than the argument

DEGREES(x)

Converts radians to degrees

EXP(x)

Returns the value of e to the power of x

FLOOR(x)

Returns the largest integer not greater than the argument

LN(x)

Returns the natural logarithm of x

LOG(x)

Returns the natural logarithm of x

LOG10(x)

Returns the base-10 logarithm of the specified float expression x

MOD(x,y)

Returns the remainder of int x divided by int y

PI()

Returns the value of pi

POWER(x,y)

Returns the value of x raised to the power of y

RADIANS(x)

Converts degrees to radians

ROUND(x)

Rounds x to the nearest integer value, but does not change the data type. For example, the double value 4.1 rounds to the double value 4.

ROUND_TO_DIGIT (x,y)

Rounds x to y decimal places

SIGN(x)

Returns the sign of x as -1, 0, 1 if x is negative, zero, or positive

SQRT(x)

Returns the square root of x.

TRUNCATE(x,y)

Truncates x to y decimal places

WIDTH_BUCKET(target,lower-boundary,upper-boundary,bucket-count)

Define equal-width intervals (buckets) in a range between the lower boundary and the upper boundary, and returns the bucket number to which the target expression is assigned.

target - A constant, column variable, or general expression for which a bucket number is returned.
lower-boundary - Lower boundary for the range of values to be partitioned equally.
upper-boundary - Upper boundary for the range of values to be partitioned equally.
partition_count - Number of equal-width buckets in the range defined by the lower and upper boundaries.

Expressions can be constants, column variables, or general expressions.

Example Create 10 age buckets of equal size, with lower bound 0 and upper bound 100 ([0,10], [10,20]... [90,100]), and classify the

age of a customer accordingly:

SELECT WIDTH_BUCKET(age, 0, 100, 10) FROM customer;

For example, a customer of age 34 is assigned to bucket 3 ([30,40]) and the function returns the value 3.

Trigonometric Functions

Function

Description

ACOS(x)

Returns the arc cosine of x

ASIN(x)

Returns the arc sine of x

ATAN(x)

Returns the arc tangent of x

ATAN2(y,x)

Returns the arc tangent of (x, y) in the range (-π,π]. Equal to ATAN(y/x) for x > 0.

COS(x)

Returns the cosine of x

COT(x)

Returns the cotangent of x

SIN(x)

Returns the sine of x

TAN(x)

Returns the tangent of x

Geometric Functions

Function

Description

DISTANCE_IN_METERS(fromLon, fromLat, toLon, toLat)

Calculates distance in meters between two WGS84 positions.

CONV_4326_900913_X(x)

Converts WGS84 latitude to WGS84 Web Mercator x coordinate.

CONV_4326_900913_Y(y)

Converts WGS84 longitude to WGS84 Web Mercator y coordinate.

String Functions

Function

Description

BASE64_DECODE(str)

Decodes a BASE64-encoded string.

BASE64_ENCODE(str)

Encodes a string to a BASE64-encoded string.

CHAR_LENGTH(str)

Returns the number of characters in a string. Only works with unencoded fields (ENCODING set to none).

str1 || str2 [ || str3... ]

Returns the string that results from concatenating the strings specified. Note that numeric, date, timestamp, and time types will be implicitly casted to strings as necessary, so explicit casts of non-string types to string types is not required for inputs to the concatenation operator. Note that concatenating a variable string with a string literal, i.e. county_name || ' County' is significantly more performant than concatenating two or more variable strings, i.e. county_name || ', ' || state_name. Hence for for multi-variable string concatenation, it is recommended to use an update statement to materialize the concatenated output rather than performing it inline when such operations are expected to be routinely repeated.

ENCODE_TEXT(none_encoded_str)

Converts a none-encoded string to a transient dictionary-encoded string to allow for operations like group-by on top. When the watchdog is enabled, the number of strings that can be casted using this operator is capped by the value set with the watchdog-none-encoded-string-translation-limit flag (1,000,000 by default).

HASH(str)

Deterministically Hashes a string input to a BIGINT output using a pseudo-random function. Can be useful for bucketing string values or deterministcally coloring by string values for a high-cardinality TEXT column. Note that currently HASH only accepts TEXT inputs, but in the future may also accept other data types. It should also be noted that NULL values always hash to NULL outputs.

INITCAP(str)

Returns the string with initial caps after any of the defined delimiter characters, with the remainder of the characters lowercased. Valid delimiter characters are !, ?, @, ", ^, #, $, &, ~, _, ,, ., :, ;, +, -, *, %, /, |, \, [, ], (, ), {, }, <, >.

JAROWINKLER_SIMILARITY( str1, str2 )

Computes the Jaro-Winkler similarity score between two input strings. The output will be an integer between 0 and 100, with 0 representing completely dissimilar strings, and 100 representing exactly matching strings.

JSON_VALUE(json_str, path)

Returns the string of a field given by path instr. Paths start with the $ character, with sub-fields split by . and array members indexed by [], with array indices starting at 0. For example, JSON_VALUE('{"name": "Brenda", "scores": [89, 98, 94]}', '$.scores[1]') would yield a TEXT return field of '98'. Note that currentlyLAX parsing mode (any unmatched path returns null rather than errors) is the default, and STRICT parsing mode is not supported.

KEY_FOR_STRING(str)

Returns the dictionary key of a dictionary-encoded string column.

LCASE(str)

Returns the string in all lower case. Only ASCII character set is currently supported. Same as LOWER.

LEFT(str, num)

Returns the left-most number (num) of characters in the string (str).

LENGTH(str)

Returns the length of a string in bytes. Only works with unencoded fields (ENCODING set to none).

LEVENSHTEIN_DISTANCE( str1, str2 )

Computes the edit distance, or number of single-character insertions, deletions, or substitutions, that must be made to make the first string equal the second. It returns an integer greater than or equal to 0, with 0 meaning the strings are equal. The higher the return value, the more the two strings can be thought of as dissimilar.

LOWER(str)

Returns the string in all lower case. Only ASCII character set is currently supported. Same as LCASE.

LPAD(str, len, [lpad_str ])

Left-pads the string with the string defined in lpad_str to a total length of len. If the optional lpad_str is not specified, the space character is used to pad. If the length of str is greater than len, then characters from the end of str are truncated to the length of len. Characters are added from lpad_str successively until the target length len is met. If lpad_str concatenated with str is not long enough to equal the target len, lpad_str is repeated, partially if necessary, until the target length is met.

LTRIM(str, chars)

Removes any leading characters specified in chars from the string. Alias for TRIM.

OVERLAY(strPLACING replacement_strFROM start [FORlen])

Replaces in str the number of characters defined in len with characters defined in replacement_str at the location start. Regardless of the length of replacement_str, len characters are removed from str unless start + replacement_str is greater than the length of str, in which case all characters from start to the end of str are replaced. Ifstart is negative, it specifies the number of characters from the end of str.

POSITION ( search_str IN str [FROM start_position])

Returns the position of the first character in search_str if found in str, optionally starting the search at start_position. If search_str is not found, 0 is returned. If search_str or str are null, null is returned.

REGEXP_COUNT(str, pattern [, position, [flags]])

Returns the number of times that the provided pattern occurs in the search string str. position specifies the starting position in str for which the search for pattern will start (all matches before position will be ignored. If position is negative, the search will start that many characters from the end of the string str. Use the following optional flags to control the matching behavior: c - Case-sensitive matching. i - Case-insensitive matching.

REGEXP_REPLACE(str, pattern [, new_str, position, occurrence, [flags]])

Replace one or all matches of a substring in string str that matches pattern , which is a regular expression in POSIX regex syntax.

new_str (optional) is the string that replaces the string matching the pattern. If new_str is empty or not supplied, all found matches are removed.

The occurrence integer argument (optional) specifies the single match occurrence of the pattern to replace, starting from the beginning of str; 0 (replace all) is the default. Use a negative occurrence argument to signify the nth-to-last occurrence to be replaced.

pattern uses .

Use a positive position argument to indicate the number of characters from the beginning of str. Use a negative position argument to indicate the number of characters from the end of str.

Back-references/capture groups can be used to capture and replace specific sub-expressions.

Use the following optional flags to control the matching behavior: c - Case-sensitive matching. i - Case-insensitive matching.

If not specified, REGEXP_REPLACE defaults to case sensitive search.

REGEXP_SUBSTR(str, pattern [, position, occurrence, flags, group_num])

Search string str for pattern, which is a , and return the matching substring.

Use position to set the character position to begin searching. Use occurrence to specify the occurrence of the pattern to match.

Use a positive position argument to indicate the number of characters from the beginning of str. Use a negative position argument to indicate the number of characters from the end of str.

The occurrence integer argument (optional) specifies the single match occurrence of the pattern to replace, with 0 being mapped to the first (1) occurrence. Use a negative occurrence argument to signify the nth-to-last group in pattern is returned.

Use optional flags to control the matching behavior: c - Case-sensitive matching.

e - Extract submatches. i - Case-insensitive matching.

The c and i flags cannot be used together; e can be used with either. If neither c nor i are specified, or if pattern is not provided, REGEXP_SUBSTR defaults to case-sensitive search.

If the e flag is used, REGEXP_SUBSTR returns the capture group group_num of pattern matched in str. If the e flag is used, but no capture groups are provided in pattern, REGEXP_SUBSTR returns the entire matching pattern, regardless of group_num. If the e flag is used but no group_num is provided, a value of 1 for group_num is assumed, so the first capture group is returned.

REPEAT(str, num)

Repeats the string the number of times defined in num.

REPLACE(str, from_str, new_str)

Replaces all occurrences of substring from_str within a string, with a new substring new_str.

REVERSE(str)

Reverses the string.

RIGHT(str, num)

Returns the right-most number (num) of characters in the string (str).

RPAD(str, len, rpad_str)

Right-pads the string with the string defined in rpad_str to a total length of len. If the optional rpad_str is not specified, the space character is used to pad. If the length of str is greater than len, then characters from the beginning of str are truncated to the length of len. Characters are added from rpad_str successively until the target length len is met. If rpad_str concatenated with str is not long enough to equal the target len, rpad_str is repeated, partially if necessary, until the target length is met.

RTRIM(str)

Removes any trailing spaces from the string.

SPLIT_PART(str, delim, field_num)

Split the string based on a delimiter delim and return the field identified by field_num. Fields are numbered from left to right.

STRTOK_TO_ARRAY(str, [delim])

Tokenizes the string str using optional delimiter(s) delim and returns an array of tokens. An empty array is returned if no tokens are produced in tokenization. NULL is returned if either parameter is a NULL.

SUBSTR(str, start, [len])

Alias for SUBSTRING.

SUBSTRING(str FROM start [ FOR len])

Returns a substring of str starting at index start for len characters.

The start position is 1-based (that is, the first character of str is at index 1, not 0). However, start 0 aliases to start 1.

If start is negative, it is considered to be |start| characters from the end of the string.

If len is not specified, then the substring from start to the end of str is returned.

If len is not specified, then the substring from start to the end of str is returned.

If start + len is greater than the length of str, then the characters in str from start to the end of the string are returned.

TRIM([BOTH | LEADING | TRAILING] [trim_str FROM str])

Removes characters defined in trim_str from the beginning, end, or both of str. If trim_str is not specified, the space character is the default. If the trim location is not specified, defined characters are trimmed from both the beginning and end of str.

TRY_CAST( str AS type)

Attempts to cast/convert a string type to any valid numeric, timestamp, date, or time type. If the conversion cannot be performed, null is returned. Note that TRY_CAST is not valid for non-string input types.

UCASE(str)

Returns the string in uppercase format. Only ASCII character set is currently supported. Same as UPPER.

UPPER(str)

Returns the string in uppercase format. Only ASCII character set is currently supported. Same as UCASE.

URL_DECODE( str )

Decode a url-encoded string. This is the inverse of the URL_ENCODE function.

URL_ENCODE( str )

Url-encode a string. Alphanumeric and the 4 characters: _-.~ are untranslated. The space character is translated to +. All other characters are translated into a 3-character sequence %XX where XX is the 2-digit hexadecimal ASCII value of the character.

Pattern-Matching Functions

Name

Example

Description

str LIKE pattern

'ab' LIKE 'ab'

Returns true if the string matches the pattern (case-sensitive)

str NOT LIKE pattern

'ab' NOT LIKE 'cd'

Returns true if the string does not match the pattern

str ILIKE pattern

'AB' ILIKE 'ab'

Returns true if the string matches the pattern (case-insensitive). Supported only when the right side is a string literal; for example, colors.name ILIKE 'b%

str REGEXP POSIX pattern

'^[a-z]+r$'

Lowercase string ending with r

REGEXP_LIKE ( str , POSIX pattern )

'^[hc]at'

cat or hat

Usage Notes

The following wildcard characters are supported by LIKE and ILIKE:

% matches any number of characters, including zero characters.
_ matches exactly one character.

Date/Time Functions

Function

Description

CURRENT_DATE

CURRENT_DATE()

Returns the current date in the GMT time zone.

Example:

SELECT CURRENT_DATE();

CURRENT_TIME

CURRENT_TIME()

Returns the current time of day in the GMT time zone.

Example:

SELECT CURRENT_TIME();

CURRENT_TIMESTAMP

CURRENT_TIMESTAMP()

Return the current timestamp in the GMT time zone. Same as NOW().

Example:

SELECT CURRENT_TIMESTAMP();

DATEADD('date_part', interval, date | timestamp)

Returns a date after a specified time/date interval has been added.

Example:

SELECT DATEADD('MINUTE', 6000, dep_timestamp) Arrival_Estimate FROM flights_2008_10k LIMIT 10;

DATEDIFF('date_part', date, date)

Returns the difference between two dates, calculated to the lowest level of the date_part you specify. For example, if you set the date_part as DAY, only the year, month, and day are used to calculate the result. Other fields, such as hour and minute, are ignored.

Example:

SELECT DATEDIFF('YEAR', plane_issue_date, now()) Years_In_Service FROM flights_2008_10k LIMIT 10;

DATEPART('interval', date | timestamp)

Returns a specified part of a given date or timestamp as an integer value. Note that 'interval' must be enclosed in single quotes.

Example:

SELECT DATEPART('YEAR', plane_issue_date) Year_Issued FROM flights_2008_10k LIMIT 10;

DATE_TRUNC(date_part, timestamp)

Truncates the timestamp to the specified date_part. DATE_TRUNC(week,...) starts on Monday (ISO), which is different than EXTRACT(dow,...), which starts on Sunday.

Example:

SELECT DATE_TRUNC(MINUTE, arr_timestamp) Arrival FROM flights_2008_10k LIMIT 10;

EXTRACT(date_part FROM timestamp)

Returns the specified date_part from timestamp.

Example:

SELECT EXTRACT(HOUR FROM arr_timestamp) Arrival_Hour FROM flights_2008_10k LIMIT 10;

INTERVAL 'count' date_part

Adds or Subtracts count date_part units from a timestamp. Note that 'count' is enclosed in single quotes.

Example:

SELECT arr_timestamp + INTERVAL '10' YEAR FROM flights_2008_10k LIMIT 10;

NOW()

Return the current timestamp in the GMT time zone. Same as CURRENT_TIMESTAMP().

Example:

NOW();

TIMESTAMPADD(date_part, count, timestamp | date)

Adds an interval of count date_part to timestamp or date and returns signed date_part units in the provided timestamp or date form.

Example:

SELECT TIMESTAMPADD(DAY, 14, arr_timestamp) Fortnight FROM flights_2008_10k LIMIT 10;

TIMESTAMPDIFF(date_part, timestamp1, timestamp2)

Subtracts timestamp1 from timestamp2 and returns the result in signed date_part units.

Example:

SELECT TIMESTAMPDIFF(MINUTE, arr_timestamp, dep_timestamp) Flight_Time FROM flights_2008_10k LIMIT 10;

Supported Types

Supported date_part types:

DATE_TRUNC [YEAR, QUARTER, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, 
            MICROSECOND, NANOSECOND, MILLENNIUM, CENTURY, DECADE, WEEK, 
            WEEK_SUNDAY, QUARTERDAY]
EXTRACT    [YEAR, QUARTER, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, 
            MICROSECOND, NANOSECOND, DOW, ISODOW, DOY, EPOCH, QUARTERDAY, 
            WEEK, WEEK_SUNDAY, DATEEPOCH]
DATEDIFF   [YEAR, QUARTER, MONTH, DAY, HOUR, MINUTE, SECOND, MILLISECOND, 
            MICROSECOND, NANOSECOND, WEEK]

Supported interval types:

DATEADD       [DECADE, YEAR, QUARTER, MONTH, WEEK, WEEKDAY, DAY, 
               HOUR, MINUTE, SECOND, MILLISECOND, MICROSECOND, NANOSECOND]
TIMESTAMPADD  [YEAR, QUARTER, MONTH, WEEKDAY, DAY, HOUR, MINUTE,
               SECOND, MILLISECOND, MICROSECOND, NANOSECOND]
DATEPART      [YEAR, QUARTER, MONTH, DAYOFYEAR, QUARTERDAY, WEEKDAY, DAY, HOUR,
               MINUTE, SECOND, MILLISECOND, MICROSECOND, NANOSECOND]

Accepted Date, Time, and Timestamp Formats

Datatype

Formats

Examples

DATE

YYYY-MM-DD

2013-10-31

DATE

MM/DD/YYYY

10/31/2013

DATE

DD-MON-YY

31-Oct-13

DATE

DD/Mon/YYYY

31/Oct/2013

EPOCH

1383262225

TIME

HH:MM

23:49

TIME

HHMMSS

234901

TIME

HH:MM:SS

23:49:01

TIMESTAMP

DATE TIME

31-Oct-13 23:49:01

TIMESTAMP

DATETTIME

31-Oct-13T23:49:01

TIMESTAMP

DATE:TIME

11/31/2013:234901

TIMESTAMP

DATE TIME ZONE

31-Oct-13 11:30:25 -0800

TIMESTAMP

DATE HH.MM.SS PM

31-Oct-13 11.30.25pm

TIMESTAMP

DATE HH:MM:SS PM

31-Oct-13 11:30:25pm

TIMESTAMP

1383262225

Usage Notes

For two-digit years, years 69-99 are assumed to be previous century (for example, 1969), and 0-68 are assumed to be current century (for example, 2016).
For four-digit years, negative years (BC) are not supported.
Hours are expressed in 24-hour format.
When time components are separated by colons, you can write them as one or two digits.
Months are case insensitive. You can spell them out or abbreviate to three characters.
For timestamps, decimal seconds are ignored. Time zone offsets are written as +/-HHMM.
For timestamps, a numeric string is converted to +/- seconds since January 1, 1970. Supported timestamps range from -30610224000 (January 1, 1000) through 29379456000 (December 31, 2900).
On output, dates are formatted as YYYY-MM-DD. Times are formatted as HH:MM:SS.
Linux EPOCH values range from -30610224000 (1/1/1000) through 185542587100800 (1/1/5885487). Complete range in years: +/-5,883,517 around epoch.

Statistical and Aggregate Functions

Both double-precision (standard) and single-precision floating point statistical functions are provided. Single-precision functions run faster on GPUs but might cause overflow errors.

Double-precision FP Function

Single-precision FP Function

Description

AVG(x)

Returns the average value of x

COUNT()

Returns the count of the number of rows returned

COUNT(DISTINCT x)

Returns the count of distinct values of x

APPROX_COUNT_DISTINCT(x, e)

Returns the approximate count of distinct values of x with defined expected error rate e, where e is an integer from 1 to 100. If no value is set for e, the approximate count is calculated using the system-widehll-precision-bits configuration parameter.

APPROX_MEDIAN(x)

Returns the approximate median of x. Two server configuration parameters affect memory usage:

<code></code><code></code>
<code></code>

Accuracy of APPROX_MEDIAN depends on the distribution of data; see .

APPROX_PERCENTILE(x,y)

Returns the approximate quantile of x, where y is the value between 0 and 1.

For example, y=0 returns MIN(x), y=1 returns MAX(x), and y=0.5 returns APPROX_MEDIAN(x).

MAX(x)

Returns the maximum value of x

MIN(x)

Returns the minimum value of x

SINGLE_VALUE

Returns the input value if there is only one distinct value in the input; otherwise, the query fails.

SUM(x)

Returns the sum of the values of x

SAMPLE(x)

Returns one sample value from aggregated column x. For example, the following query returns population grouped by city, along with one value from the state column for each group:

Note: This was previously LAST_SAMPLE, which is now deprecated.

CORRELATION(x, y)

CORRELATION_FLOAT(x, y)

Alias of CORR. Returns the coefficient of correlation of a set of number pairs.

CORR(x, y)

CORR_FLOAT(x, y)

Returns the coefficient of correlation of a set of number pairs.

COUNT_IF(conditional_expr)

Returns the number of rows satisfying the given condition_expr.

COVAR_POP(x, y)

COVAR_POP_FLOAT(x, y)

Returns the population covariance of a set of number pairs.

COVAR_SAMP(x, y)

COVAR_SAMP_FLOAT(x, y)

Returns the sample covariance of a set of number pairs.

STDDEV(x)

STDDEV_FLOAT(x)

Alias of STDDEV_SAMP. Returns sample standard deviation of the value.

STDDEV_POP(x)

STDDEV_POP_FLOAT(x)

Returns the population standard the standard deviation of the value.

STDDEV_SAMP(x)

STDDEV_SAMP_FLOAT(x)

Returns the sample standard deviation of the value.

SUM_IF(conditional_expr)

Returns the sum of all expression values satisfying the given condition_expr.

VARIANCE(x)

VARIANCE_FLOAT(x)

Alias of VAR_SAMP. Returns the sample variance of the value.

VAR_POP(x)

VAR_POP_FLOAT(x)

Returns the population variance sample variance of the value.

VAR_SAMP(x)

VAR_SAMP_FLOAT(x)

Returns the sample variance of the value.

Usage Notes

COUNT(DISTINCT x), especially when used in conjunction with GROUP BY, can require a very large amount of memory to keep track of all distinct values in large tables with large cardinalities. To avoid this large overhead, use APPROX_COUNT_DISTINCT.
APPROX_COUNT_DISTINCT(x, e) gives an approximate count of the value x, based on an expected error rate defined in e. The error rate is an integer value from 1 to 100. The lower the value of e, the higher the precision, and the higher the memory cost. Select a value for e based on the level of precision required. On large tables with large cardinalities, consider using APPROX_COUNT_DISTINCT when possible to preserve memory. When data cardinalities permit, OmniSci uses the precise implementation of COUNT(DISTINCT x) for APPROX_COUNT_DISTINCT. Set the default error rate using the -hll-precision-bits configuration parameter.
The accuracy of APPROX_MEDIAN (x) upon the distribution of data. For example:
- For 100,000,000 integers (1, 2, 3, ... 100M) in random order, APPROX_MEDIAN can provide a highly accurate answer 5+ significant digits.
- For 100,000,001 integers, where 50,000,000 have value of 0 and 50,000,001 have value of 1, APPROX_MEDIAN returns a value close to 0.5, even though the median is 1.
Currently, OmniSci does not support grouping by non-dictionary-encoded strings. However, with the SAMPLE aggregate function, you can select non-dictionary-encoded strings that are presumed to be unique in a group. For example:
```
SELECT user_name, SAMPLE(user_decription) FROM tweets GROUP BY user_name;
```
If the aggregated column (user_description in the example above) is not unique within a group, SAMPLE selects a value that might be nondeterministic because of the parallel nature of OmniSci query execution.

Miscellaneous Functions

Function

Description

SAMPLE_RATIO(x)

Returns a Boolean value, with the probability of True being returned for a row equal to the input argument. The input argument is a numeric value between 0.0 and 1.0. Negative input values (return False), input values greater than 1.0 returns True, and null input values return False.

The result of the function is deterministic per row; that is, all calls of the operator for a given row return the same result. The sample ratio is probabilistic, but is generally within a thousandth of a percentile of the actual range when the underlying dataset is millions of records or larger.

The following example filters approximately 50% of the rows from t and returns a count that is approximately half the number of rows in t:

SELECT COUNT(*) FROM t WHERE SAMPLE_RATIO(0.5)

User-Defined Functions

You can create your own C++ functions and use them in your SQL queries.

User-defined Functions (UDFs) require clang++ version 9. You can verify the version installed using the command clang++ --version.
UDFs currently allow any authenticated user to register and execute a runtime function. By default, runtime UDFs are globally disabled but can be enabled with the runtime flag enable-runtime-udf.

Create your function and save it in a .cpp file; for example, /var/lib/omnisci/udf_myFunction.cpp.
Add the UDF configuration flag to omnisci.conf. For example:
```
udf = "/var/lib/omnisci/udf_myFunction.cpp"
```
Use your function in a SQL query. For example:
```
SELECT udf_myFunction FROM myTable
```

Sample User-Defined Function

This function, udf_diff.cpp, returns the difference of two values from a table.

#include <cstdint>
#if defined(__CUDA_ARCH__) && defined(__CUDACC__) && defined(__clang__)
#define DEVICE __device__
#define NEVER_INLINE
#define ALWAYS_INLINE
#else
#define DEVICE
#define NEVER_INLINE __attribute__((noinline))
#define ALWAYS_INLINE __attribute__((always_inline))
#endif
#define EXTENSION_NOINLINE extern "C" NEVER_INLINE DEVICE
EXTENSION_NOINLINE int32_t udf_diff(const int32_t x, const int32_t y) { return x - y; }

Code Commentary

Include the standard integer library, which supports the following datatypes:

bool
int8_t (cstdint), char
int16_t (cstdint), short
int32_t (cstdint), int
int64_t (cstdint), size_t
float
double
void

#include <cstdint>

The next four lines are boilerplate code that allows OmniSci to determine whether the server is running with GPUs. OmniSci chooses whether it should compile the function inline to achieve the best possible performance.

#include <cstdint>
#if defined(__CUDA_ARCH__) && defined(__CUDACC__) && defined(__clang__)
#define DEVICE __device__
#define NEVER_INLINE
#define ALWAYS_INLINE
#else
#define DEVICE
#define NEVER_INLINE __attribute__((noinline))
#define ALWAYS_INLINE __attribute__((always_inline))
#endif
#define EXTENSION_NOINLINE extern "C" NEVER_INLINE DEVICE

The next line is the actual user-defined function, which returns the difference between INTEGER values x and y.

EXTENSION_NOINLINE int32_t udf_diff(const int32_t x, const int32_t y) { return x - y; }

To run the udf_diff function, add this line to your /var/lib/omnisci/omnisci.conf file (in this example, the .cpp file is stored at /var/lib/omnisci/udf_diff.cpp):

udf = "/var/lib/omnisci/udf_diff.cpp"

Restart the OmniSci server.

Use your command from an OmniSci SQL client to query, for example, a table named myTable that contains the INTEGER columns myInt1 and myInt2.

SELECT udf_diff(myInt1, myInt2) FROM myTable LIMIT 1;

OmniSci returns the difference as an INTEGER value.

Roles and Privileges

HEAVY.AI supports data security using a set of database object access privileges granted to users or roles.

Users and Privileges

When you create a database, the admin superuser is created by default. The admin superuser is granted all privileges on all database objects. Superusers can create new users that, by default, have no database object privileges.

Superusers can grant users selective access privileges on multiple database objects using two mechanisms: role-based privileges and user-based privileges.

Role-based Privileges

Grant roles access privileges on database objects.
Grant roles to users.
Grant roles to other roles.

User-based Privileges

When a user has privilege requirements that differ from role privileges, you can grant privileges directly to the user. These mechanisms provide data security for many users and classes of users to access the database.

You have the following options for granting privileges:

Each object privilege can be granted to one or many roles, or to one or many users.
A role and/or user can be granted privileges on one or many objects.
A role can be granted to one or many users or other roles.
A user can be granted one or many roles.

This supports the following many-to-many relationships:

Objects and roles
Objects and users
Roles and users

These relationships provide flexibility and convenience when granting/revoking privileges to and from users.

Granting object privileges to roles and users, and granting roles to users, has a cumulative effect. The result of several grant commands is a combination of all individual grant commands. This applies to all database object types and to privileges inherited by objects. For example, object privileges granted to the object of database type are propagated to all table-type objects of that database object.

Who Can Grant Object Privileges?

Only a superuser or an object owner can grant privileges for on object.

A superuser has all privileges on all database objects.
A non-superuser user has only those privileges on a database object that are granted by a superuser.
A non-superuser user has ALL privileges on a table created by that user.

Roles and Privileges Persistence

Roles can be created and dropped at any time.
Object privileges and roles can be granted or revoked at any time, and the action takes effect immediately.
Privilege state is persistent and restored if the HEAVY.AI session is interrupted.

Database Object Privileges

There are five database object types, each with its own privileges.

ACCESS - Connect to the database. The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.

ALL - Allow all privileges on this database except issuing grants and dropping the database.

SELECT, INSERT, TRUNCATE, UPDATE, DELETE - Allow these operations on any table in the database.

ALTER SERVER - Alter servers in the current database.

CREATE SERVER - Create servers in the current database.

CREATE TABLE - Create a table in the current database. (Also CREATE.)

CREATE VIEW - Create a view for the current database.

CREATE DASHBOARD - Create a dashboard for the current database.

DELETE DASHBOARD - Delete a dashboard for this database.

DROP SERVER - Drop servers from the current database.

DROP - Drop a table from the database.

DROP VIEW - Drop a view for this database.

EDIT DASHBOARD - Edit a dashboard for this database.

SELECT VIEW - Select a view for this database.

SERVER USAGE - Use servers (through foreign tables) in the current database.

VIEW DASHBOARD - View a dashboard for this database.

VIEW SQL EDITOR - Access the SQL Editor in Immerse for this database.

Users with SELECT privilege on views do not require SELECT privilege on underlying tables referenced by the view to retrieve the data queried by the view. View queries work without error whether or not users have direct access to referenced tables. This also applies to views that query tables in other databases.

To create views, users must have SELECT privilege on queried tables in addition to the CREATE VIEW privilege.

SELECT, INSERT, TRUNCATE, UPDATE, DELETE - Allow these SQL statements on this table.

DROP - Drop this table.

To create views, users must have SELECT privilege on queried tables in addition to the CREATE VIEW privilege.

SELECT - Select from this view. Users do not need privileges on objects referenced by this view.

DROP - Drop this view.

To create views, users must have SELECT privilege on queried tables in addition to the CREATE VIEW privilege.

VIEW - View this dashboard.

EDIT - Edit this dashboard.

DELETE - Delete this dashboard.

DROP - Drop this server from the current database.

ALTER - Alter this server in the current database.

USAGE - Use this server (through foreign tables) in the current database.

Privileges granted on a database-type object are inherited by all tables of that database.

Privilege Commands

SQL

Description

Create role.

Drop role.

Grant role to user or to another role.

Revoke role from user or from another role.

Grant role privilege(s) on a database table to a role or user.

Revoke role privilege(s) on database table from a role or user.

Grant role privilege(s) on a database view to a role or user.

Revoke role privilege(s) on database view from a role or user.

Grant role privilege(s) on database to a role or user.

Revoke role privilege(s) on database from a role or user.

Grant role privilege(s) on server to a role or user.

Revoke role privilege(s) on server from a role or user.

Grant role privilege(s) on dashboard to a role or user.

Revoke role privilege(s) on dashboard from a role or user.

Example

The following example shows a valid sequence for granting access privileges to non-superuser user1 by granting a role to user1 and by directly granting a privilege. This example presumes that table1 and user1 already exist, and that user1 has ACCESS privileges on the database where table1 exists.

Create the r_select role.
Grant the SELECT privilege on table1 to the r_select role. Any user granted the r_select role gains the SELECT privilege.
```
GRANT SELECT ON TABLE table1 TO r_select;
```
Grant the r_select role to user1, giving user1 the SELECT privilege on table1.
Directly grant user1 the INSERT privilege on table1.
```
GRANT INSERT ON TABLE table1 TO user1;
```

See Example Roles and Privileges Session for a more complete example.

CREATE ROLE

Create a role. Roles are granted to users for role-based database object access.

This clause requires superuser privilege and <roleName> must not exist.

Synopsis

CREATE ROLE <roleName>;

Parameters

<roleName>

Name of the role to create.

Example

Create a payroll department role called payrollDept.

CREATE ROLE payrollDept;

DROP ROLE

Remove a role.

This clause requires superuser privilege and <roleName> must exist.

Synopsis

DROP ROLE [IF EXISTS] <roleName>;

Parameters

<roleName>

Name of the role to drop.

Example

Remove the payrollDept role.

DROP ROLE payrollDept;

GRANT

Grant role privileges to users and to other roles.

The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.

This clause requires superuser privilege. The specified <roleNames> and <userNames> must exist.

Synopsis

GRANT <roleNames> TO <userNames>, <roleNames>;

Parameters

<roleNames>

Names of roles to grant to users and other roles. Use commas to separate multiple role names.

<userNames>

Names of users. Use commas to separate multiple user names.

Examples

Assign payrollDept role privileges to user dennis.

GRANT payrollDept TO dennis;

Grant payrollDept and accountsPayableDept role privileges to users dennis and mike and role hrDept.

GRANT payrollDept, accountsPayableDept TO dennis, mike, hrDept;

REVOKE

Remove role privilege from users or from other roles. This removes database object access privileges granted with the role.

This clause requires superuser privilege. The specified <roleNames> and <userNames> must exist.

Synopsis

REVOKE <roleNames> FROM <userNames>, <roleNames>;

Parameters

<roleNames>

Names of roles to remove from users and other roles. Use commas to separate multiple role names.

<userName>

Names of the users. Use commas to separate multiple user names.

Example

Remove payrollDept role privileges from user dennis.

REVOKE payrollDept FROM dennis;

Revoke payrollDept and accountsPayableDept role privileges from users dennis and fred and role hrDept.

REVOKE payrollDept, accountsPayableDept FROM dennis, fred, hrDept;

GRANT ON TABLE

Define the privilege(s) a role or user has on the specified table. You can specify any combination of the INSERT, SELECT, DELETE, UPDATE, DROP, or TRUNCATE privilege or specify all privileges.

The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.

This clause requires superuser privilege, or <tableName> must have been created by the user invoking this command. The specified <tableName> and users or roles defined in <entityList> must exist.

Synopsis

GRANT <privilegeList> ON TABLE <tableName> TO <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Grant all possible access privileges on <tableName> to <entityList>.

ALTER TABLE

Grant ALTER TABLE privilege on <tableName> to <entityList>.

DELETE

Grant DELETE privilege on <tableName> to <entityList>.

DROP

Grant DROP privilege on <tableName> to <entityList>.

INSERT

Grant INSERT privilege on <tableName> to <entityList>.

SELECT

Grant SELECT privilege on <tableName> to <entityList>.

TRUNCATE

Grant TRUNCATE privilege on <tableName> to <entityList>.

UPDATE

Grant UPDATE privilege on <tableName> to <entityList>.

<tableName>

Name of the database table.

<entityList>

Name of entity or entities to be granted the privilege(s).

Parameter Value

Descriptions

role

Name of role.

user

Name of user.

Examples

Permit all privileges on the employees table for the payrollDept role.

GRANT ALL ON TABLE employees TO payrollDept;

Permit SELECT-only privilege on the employees table for user chris.

GRANT SELECT ON TABLE employees TO chris;

Permit INSERT-only privilege on the employees table for the hrdept and accountsPayableDept roles.

GRANT INSERT ON TABLE employees TO hrDept, accountsPayableDept;

Permit INSERT, SELECT, and TRUNCATE privileges on the employees table for the role hrDept and for users dennis and mike.

GRANT INSERT, SELECT, TRUNCATE ON TABLE employees TO hrDept, dennis, mike;

REVOKE ON TABLE

Remove the privilege(s) a role or user has on the specified table. You can remove any combination of the INSERT, SELECT, DELETE, UPDATE, or TRUNCATE privileges, or remove all privileges.

This clause requires superuser privilege or <tableName> must have been created by the user invoking this command. The specified <tableName> and users or roles in <entityList> must exist.

Synopsis

REVOKE <privilegeList> ON TABLE <tableName> FROM <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Remove all access privilege for <entityList> on <tableName>.

ALTER TABLE

Remove ALTER TABLE privilege for <entityList> on <tableName>.

DELETE

Remove DELETE privilege for <entityList> on <tableName>.

DROP

Remove DROP privilege for <entityList> on <tableName>.

INSERT

Remove INSERT privilege for <entityList> on <tableName>.

SELECT

Remove SELECT privilege for <entityList> on <tableName>.

TRUNCATE

Remove TRUNCATE privilege for <entityList> on <tableName>.

UPDATE

Remove UPDATE privilege for <entityList> on <tableName>.

<tableName>

Name of the database table.

<entityList>

Name of entities to be denied the privilege(s).

Parameter Value

Descriptions

role

Name of role.

user

Name of user.

Example

Prohibit SELECT and INSERT operations on the employees table for the nonemployee role.

REVOKE ALL ON TABLE employees FROM nonemployee;

Prohibit SELECT operations on the directors table for the employee role.

REVOKE SELECT ON TABLE directors FROM employee;

Prohibit INSERT operations on the directors table for role employee and user laura.

REVOKE INSERT ON TABLE directors FROM employee, laura;

Prohibit INSERT, SELECT, and TRUNCATE privileges on the employees table for the role nonemployee and for users dennis and mike.

REVOKE INSERT, SELECT, TRUNCATE ON TABLE employees FROM nonemployee, dennis, mike;

GRANT ON VIEW

Define the privileges a role or user has on the specified view. You can specify any combination of the SELECT, INSERT, or DROP privileges, or specify all privileges.

This clause requires superuser privileges, or <viewName> must have been created by the user invoking this command. The specified <viewName> and users or roles in <entityList> must exist.

Synopsis

GRANT <privilegeList> ON VIEW <viewName> TO <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Grant all possible access privileges on <viewName> to <entityList>.

DROP

Grant DROP privilege on <viewName> to <entityList>.

INSERT

Grant INSERT privilege on <viewName> to <entityList>.

SELECT

Grant SELECT privilege on <viewName> to <entityList>.

<viewName>

Name of the database view.

<entityList>

Name of entities to be granted the privileges.

Parameter Value

Descriptions

role

Name of role.

user

Name of user.

Examples

Permit SELECT, INSERT, and DROP privileges on the employees view for the payrollDept role.

GRANT ALL ON VIEW employees TO payrollDept;

Permit SELECT-only privilege on the employees view for the employee role and user venkat.

GRANT SELECT ON VIEW employees TO employee, venkat;

Permit INSERT and DROP privileges on the employees view for the hrDept and acctPayableDept roles and users simon and dmitri.

GRANT INSERT, DROP ON VIEW employees TO hrDept, acctPayableDept, simon, dmitri;

REVOKE ON VIEW

Remove the privileges a role or user has on the specified view. You can remove any combination of the INSERT, DROP, or SELECT privileges, or remove all privileges.

This clause requires superuser privilege, or <viewName> must have been created by the user invoking this command. The specified <viewName> and users or roles in <entityList> must exist.

Synopsis

REVOKE <privilegeList> ON VIEW <viewName> FROM <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Remove all access privilege for <entityList> on <viewName>.

DROP

Remove DROP privilege for <entityList> on <viewName>.

INSERT

Remove INSERT privilege for <entityList> on <viewName>.

SELECT

Remove SELECT privilege for <entityList> on <viewName>.

<viewName>

Name of the database view.

<entityList>

Name of entity to be denied the privilege(s).

Parameter Value

Descriptions

role

Name of role.

user

Name of user.

Example

Prohibit SELECT, DROP, and INSERT operations on the employees view for the nonemployee role.

REVOKE ALL ON VIEW employees FROM nonemployee;

Prohibit SELECT operations on the directors view for the employee role.

REVOKE SELECT ON VIEW directors FROM employee;

Prohibit INSERT and DROP operations on the directors view for the employee and manager role and for users ashish and lindsey.

REVOKE INSERT, DROP ON VIEW directors FROM employee, manager, ashish, lindsey;

GRANT ON DATABASE

Define the valid privileges a role or user has on the specified database. You can specify any combination of privileges, or specify all privileges.

The ACCESS privilege is a prerequisite for all other privileges at the database level. Without the ACCESS privilege, a user or role cannot perform tasks on any other database objects.

This clause requires superuser privileges.

Synopsis

GRANT <privilegeList> ON DATABASE <dbName> TO <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ACCESS

Grant ACCESS (connection) privilege on <dbName> to <entityList>.

ALL

Grant all possible access privileges on <dbName> to <entityList>.

ALTER TABLE

Grant ALTER TABLE privilege on <dbName> to <entityList>.

ALTER SERVER

Grant ALTER SERVER privilege on <dbName> to <entityList>.

CREATE SERVER

Grant CREATE SERVER privilege on <dbName> to <entityList>;

CREATE TABLE

Grant CREATE TABLE privilege on <dbName> to <entityList>. Previously CREATE.

CREATE VIEW

Grant CREATE VIEW privilege on <dbName> to <entityList>.

CREATE DASHBOARD

Grant CREATE DASHBOARD privilege on <dbName> to <entityList>.

CREATE

Grant CREATE privilege on <dbName> to <entityList>.

DELETE

Grant DELETE privilege on <dbName> to <entityList>.

DELETE DASHBOARD

Grant DELETE DASHBOARD privilege on <dbName> to <entityList>.

DROP

Grant DROP privilege on <dbName> to <entityList>.

DROP SERVER

Grant DROP privilege on <dbName> to <entityList>.

DROP VIEW

Grant DROP VIEW privilege on <dbName> to <entityList>.

EDIT DASHBOARD

Grant EDIT DASHBOARD privilege on <dbName> to <entityList>.

INSERT

Grant INSERT privilege on <dbName> to <entityList>.

SELECT

Grant SELECT privilege on <dbName> to <entityList>.

SELECT VIEW

Grant SELECT VIEW privilege on <dbName> to <entityList>.

SERVER USAGE

Grant SERVER USAGE privilege on <dbName> to <entityList>.

TRUNCATE

Grant TRUNCATE privilege on <dbName> to <entityList>.

UPDATE

Grant UPDATE privilege on <dbName> to <entityList>.

VIEW DASHBOARD

Grant VIEW DASHBOARD privilege on <dbName> to <entityList>.

VIEW SQL EDITOR

Grant VIEW SQL EDITOR privilege in Immerse on <dbName> to <entityList>.

<dbName>

Name of the database, which must exist, created by CREATE DATABASE.

<entityList>

Name of the entity to be granted the privilege.

Parameter Value

Descriptions

role

Name of role, which must exist.

user

Name of user, which must exist. See .

Examples

Permit all operations on the companydb database for the payrollDept role and user david.

GRANT ALL ON DATABASE companydb TO payrollDept, david;

Permit SELECT-only operations on the companydb database for the employee role.

GRANT ACCESS, SELECT ON DATABASE companydb TO employee;

Permit INSERT, UPDATE, and DROP operations on the companydb database for the hrdept and manager role and for users irene and stephen.

GRANT ACCESS, INSERT, UPDATE, DROP ON DATABASE companydb TO hrdept, manager, irene, stephen;

REVOKE ON DATABASE

Remove the operations a role or user can perform on the specified database. You can specify privileges individually or specify all privileges.

This clause requires superuser privilege or the user must own the database object. The specified <dbName> and roles or users in <entityList> must exist.

Synopsis

REVOKE <privilegeList> ON DATABASE <dbName> FROM <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ACCESS

Remove ACCESS (connection) privilege on <dbName> from <entityList>.

ALL

Remove all possible privileges on <dbName> from <entityList>.

ALTER SERVER

Remove ALTER SERVER privilege on <dbName> from <entityList>

ALTER TABLE

Remove ALTER TABLE privilege on <dbName> from <entityList>.

CREATE TABLE

Remove CREATE TABLE privilege on <dbName> from <entityList>. Previously CREATE.

CREATE VIEW

Remove CREATE VIEW privilege on <dbName> from <entityList>.

CREATE DASHBOARD

Remove CREATE DASHBOARD privilege on <dbName> from <entityList>.

CREATE

Remove CREATE privilege on <dbName> from <entityList>.

CREATE SERVER

Remove CREATE SERVER privilege on <dbName> from <entityList>.

DELETE

Remove DELETE privilege on <dbName> from <entityList>.

DELETE DASHBOARD

Remove DELETE DASHBOARD privilege on <dbName> from <entityList>.

DROP

Remove DROP privilege on <dbName> from <entityList>.

DROP SERVER

Remove DROP SERVER privilege on <dbName> from <entityList>.

DROP VIEW

Remove DROP VIEW privilege on <dbName> from <entityList>.

EDIT DASHBOARD

Remove EDIT DASHBOARD privilege on <dbName> from <entityList>.

INSERT

Remove INSERT privilege on <dbName> from <entityList>.

SELECT

Remove SELECT privilege on <dbName> from <entityList>.

SELECT VIEW

Remove SELECT VIEW privilege on <dbName> from <entityList>.

SERVER USAGE

Remove SERVER USAGE privilege on <dbName> from <entityList>.

TRUNCATE

Remove TRUNCATE privilege on <dbName> from <entityList>.

UPDATE

Remove UPDATE privilege on <dbName> from <entityList>.

VIEW DASHBOARD

Remove VIEW DASHBOARD privilege on <dbName> from <entityList>.

VIEW SQL EDITOR

Remove VIEW SQL EDITOR privilege in Immerse on <dbName> from <entityList>.

<dbName>

Name of the database.

<entityList>

Parameter Value

Descriptions

role

Name of role.

user

Name of user.

Example

Prohibit all operations on the employees database for the nonemployee role.

REVOKE ALL ON DATABASE employees FROM nonemployee;

Prohibit SELECT operations on the directors database for the employee role and for user monica.

REVOKE SELECT ON DATABASE directors FROM employee;

Prohibit INSERT, DROP, CREATE, and DELETE operations on the directors database for employee role and for users max and alex.

REVOKE INSERT, DROP, CREATE, DELETE ON DATABASE directors FROM employee;

GRANT ON SERVER

Define the valid privileges a role or user has for working with servers. You can specify any combination of privileges or specify all privileges.

This clause requires superuser privileges, or <serverName> must have been created by the user invoking the command.

Synopsis

GRANT <privilegeList> ON SERVER <serverName> TO <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

DROP

Grant DROP privileges on <serverName> on current database to <entityList>.

ALTER

Grant ALTER privilege on <serverName> on current database to <entityList>.

USAGE

Grant USAGE privilege (through foreign tables) on <serverName> on current database to <entityList>.

<serverName>

Name of the server, which must exist on the current database, created by CREATE SERVER ON DATABASE.

<entityList>

Parameter Value

Descriptions

role

Name of role, which must exist.

user

Name of user, which must exist. See .

Examples

Grant DROP privilege on server parquet_s3_server to user fred:

GRANT DROP ON SERVER parquet_s3_server TO fred

Grant ALTER privilege on server parquet_s3_server to role payrollDept:

GRANT ALTER ON SERVER parquet_s3_server TO payrollDept;

Grant USAGE and ALTER privileges on server parquet_s3_server to role payrollDept and user jamie:

GRANT USAGE, ALTER ON SERVER parquet_s3_server TO payrollDept, jamie;

REVOKE ON SERVER

Remove privileges a role or user has for working with servers. You can specify any combination of privileges or specify all privileges.

This clause requires superuser privileges, or <serverName> must have been created by the user invoking the command.

Synopsis

REVOKE <privilegeList> ON SERVER <serverName> FROM <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

DROP

Remove DROP privileges on <serverName> on current database for <entityList>.

ALTER

Remove ALTER privilege on <serverName> on current database for <entityList>.

USAGE

Remove USAGE privilege (through foreign tables) on <serverName> on current database for <entityList>.

<serverName>

Name of the server, which must exist on the current database, created by CREATE SERVER ON DATABASE.

<entityList>

Parameter Value

Descriptions

role

Name of role, which must exist.

user

Name of user, which must exist. See .

Examples

Revoke DROP privilege on server parquet_s3_server for user inga:

REVOKE DROP ON SERVER parquet_s3_server FROM inga

Grant ALTER privilege on server parquet_s3_server for role payrollDept:

REVOKE ALTER ON SERVER parquet_s3_server FROM payrollDept;

Grant USAGE and ALTER privileges on server parquet_s3_server for role payrollDept and user marvin:

REVOKE USAGE, ALTER ON SERVER parquet_s3_server FROM payrollDept, marvin;

GRANT ON DASHBOARD

Define the valid privileges a role or user has for working with dashboards. You can specify any combination of privileges or specify all privileges.

This clause requires superuser privileges.

Synopsis

GRANT <privilegeList> [ON DASHBOARD <dashboardId>] TO <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Grant all possible access privileges on <dashboardId> to <entityList>.

CREATE

Grant CREATE privilege to <entityList>.

DELETE

Grant DELETE privilege on <dashboardId> to <entityList>.

EDIT

Grant EDIT privilege on <dashboardId> to <entityList>.

VIEW

Grant VIEW privilege on <dashboardId> to <entityList>.

<dashboardId>

ID of the dashboard, which must exist, created by CREATE DASHBOARD. To show a list of all dashboards and IDs in heavysql, run the \dash command when logged in as superuser.

<entityList>

Parameter Value

Descriptions

role

Name of role, which must exist.

user

Name of user, which must exist. See .

Examples

Permit all privileges on the dashboard ID 740 for the payrollDept role.

GRANT ALL ON DASHBOARD 740 TO payrollDept;

Permit VIEW-only privilege on dashboard 730 for the hrDept role and user dennis.

GRANT VIEW ON DASHBOARD 730 TO hrDept, dennis;

Permit EDIT and DELETE privileges on dashboard 740 for the hrDept and accountsPayableDept roles and for user pavan.

GRANT EDIT, DELETE ON DASHBOARD 740 TO hrdept, accountsPayableDept, pavan;

REVOKE ON DASHBOARD

Remove privileges a role or user has for working with dashboards. You can specify any combination of privileges, or all privileges.

This clause requires superuser privileges.

Synopsis

REVOKE <privilegeList> [ON DASHBOARD <dashboardId>] FROM <entityList>;

Parameters

<privilegeList>

Parameter Value

Descriptions

ALL

Revoke all possible access privileges on <dashboardId> for <entityList>.

CREATE

Revoke CREATE privilege for <entityList>.

DELETE

Revoke DELETE privilege on <dashboardId> for <entityList>.

EDIT

Revoke EDIT privilege on <dashboardId> for <entityList>.

VIEW

Revoke VIEW privilege on <dashboardId> for <entityList>.

<dashboardId>

ID of the dashboard, which must exist, created by CREATE DASHBOARD.

<entityList>

Parameter Value

Descriptions

role

Name of role, which must exist.

user

Name of user, which must exist. See .

Revoke DELETE privileges on dashboard 740 for the payrollDept role.

REVOKE DELETE ON DASHBOARD 740 FROM payrollDept;

Revoke all privileges on dashboard 730 for hrDept role and users dennis and mike.

REVOKE ALL ON DASHBOARD 730 FROM hrDept, dennis, mike;

Revoke EDIT and DELETE of dashboard 740 for the hrDept and accountsPayableDept roles and for users dante and jonathan.

REVOKE EDIT, DELETE ON DASHBOARD 740 FROM hrdept, accountsPayableDept, dante, jonathan;

Common Privilege Levels for Non-Superusers

The following privilege levels are typically recommended for non-superusers in Immerse. Privileges assigned for users in your organization may vary depending on access requirements.

Privilege

Command Syntax to Grant Privilege

Access a database

GRANT ACCESS ON DATABASE <dbName> TO <entityList>;

Create a table

GRANT CREATE TABLE ON DATABASE <dbName> TO <entityList>;

Select a table

GRANT SELECT ON TABLE <tableName> TO <entityList>;

View a dashboard

GRANT VIEW ON DASHBOARD <dashboardId> TO <entityList>;

Create a dashboard

GRANT CREATE DASHBOARD ON DATABASE <dbName> TO <entityList>;

Edit a dashboard

GRANT EDIT ON DASHBOARD TO ;

Delete a dashboard

GRANT DELETE DASHBOARD ON DATABASE <dbName> TO <entityList>;

Example: Roles and Privileges

These examples assume that tables table1 through table4 are created as needed:

create table table1 (id smallint);
create table table2 (id smallint);
create table table3 (id smallint);
create table table4 (id smallint);

The following examples show how to work with users, roles, tables, and dashboards.

Create User Accounts

create user marketingDeptEmployee1 (password = 'md1');
create user marketingDeptEmployee2 (password = 'md2');
create user marketingDeptManagerEmployee3 (password = 'md3');

create user salesDeptEmployee1 (password = 'sd1');
create user salesDeptEmployee2 (password = 'sd2');
create user salesDeptEmployee3 (password = 'sd3');
create user salesDeptEmployee4 (password = 'sd4');
create user salesDeptManagerEmployee5 (password = 'sd5');

Grant Access to Users on Database

grant access on database heavyai to marketingDeptEmployee1, marketingDeptEmployee2, marketingDeptManagerEmployee3;
grant access on database heavyai to salesDeptEmployee1, salesDeptEmployee2, salesDeptEmployee3, salesDeptEmployee4, salesDeptManagerEmployee5;

Create Marketing Department Roles

create role marketingDeptRole1;
create role marketingDeptRole2;

Grant Marketing Department Roles to Marketing Department Employees

grant marketingDeptRole1 to marketingDeptEmployee1, marketingDeptManagerEmployee3;
grant marketingDeptRole2 to marketingDeptEmployee2, marketingDeptManagerEmployee3;

Grant Privelege to Marketing Department Roles

grant select on table table1 to marketingDeptRole1;
grant select on table table2 to marketingDeptRole1;
grant select on table table2 to marketingDeptRole2;

Create Sales Department Roles

create role salesDeptRole1;
create role salesDeptRole2;
create role salesDeptRole3;

Grant Sales Department Roles to Sales Department Employees

grant salesDeptRole1 to salesDeptEmployee1;
grant salesDeptRole2 to salesDeptEmployee2, salesDeptEmployee3;
grant salesDeptRole3 to salesDeptEmployee4;

Grant Privilege to Sales Department Roles

grant select on table table1 to salesDeptRole1;
grant select on table table3 to salesDeptRole1, salesDeptRole2;
grant select on table table4 to salesDeptRole3;

Grant All Sales Roles to Sales Department Manager and Marketing Department Manager

grant salesDeptRole1, salesDeptRole2, salesDeptRole3 to salesDeptManagerEmployee5, marketingDeptManagerEmployee3;

Grant View on Dashboards

Use the \dash command to list all dashboards and their unique IDs in HEAVY.AI:

heavysql> \dash 
Dashboard ID | Dashboard Name    | Owner 
1            | Marketing_Summary | heavyai

Here, the Marketing_Summary dashboard uses table2 as a data source. The role marketingDeptRole2 has select privileges on that table. Grant view access on the Marketing_Summary dashboard to marketingDeptRole2:

grant view on dashboard 1 to marketingDeptRole2;

Relationships Between Users, Roles, and Tables

The following table shows the roles and privileges for each user created in the previous example.

User

Roles Granted

Table Privileges

salesDeptEmployee1

salesDeptRole1

SELECT on Tables 1, 3

salesDeptEmployee2

salesDeptRole2

SELECT on Table 3

salesDeptEmployee3

salesDeptRole2

SELECT on Table 3

salesDeptEmployee4

salesDeptRole3

SELECT on Table 4

salesDeptManagerEmployee5

salesDeptRole1, salesDeptRole2, salesDeptRole3

SELECT on Tables 1, 3, 4

marketingDeptEmployee1

marketingDeptRole1

SELECT on Tables 1, 2

marketingDeptEmployee2

marketingDeptRole2

SELECT on Table 2

marketingDeptManagerEmployee3

marketingDeptRole1, marketingDeptRole2, salesDeptRole1, salesDeptRole2, salesDeptRole3

SELECT on Tables 1, 2, 3, 4

Commands to Report Roles and Privileges

Use the following commands to list current roles and assigned privileges. If you have superuser access, you can see privileges for all users. Otherwise, you can see only those roles and privileges for which you have access.

Results for users, roles, privileges, and object privileges are returned in creation order.

\dash

Lists all dashboards and dashboard IDs in HEAVY.AI. Requires superuser privileges. Dashboard privileges are assigned by dashboard ID because dashboard names may not be unique.

Example

heavysql> \dash database heavyai 
Dashboard ID | Dashboard Name    | Owner 
1            | Marketing_Summary | heavyai

heavysql> \dash database heavyai Dashboard ID | Dashboard Name | Owner 1 | Marketing_Summary | heavyai

\object_privileges objectType `_objectName`_

Reports all privileges granted to the specified object for all roles and users. If the specified objectName does not exist, no results are reported. Used for databases and tables only.

Example

heavysql> \object_privileges database heavyai 
marketingDeptEmployee1 privileges: login-access 
marketingDeptEmployee2 privileges: login-access marketingDeptManagerEmployee3 privileges: login-access
salesDeptEmployee1 privileges: login-access 
salesDeptEmployee2 privileges: login-access 
salesDeptEmployee3 privileges: login-access 
salesDeptEmployee4 privileges: login-access 
salesDeptManagerEmployee5 privileges: login-access

\privileges roleName | userName

Reports all object privileges granted to the specified role or user. The roleName or userName specified must exist.

Example

heavysql> \privileges salesDeptRole1 
table1 (table): select 
table3 (table): select
heavysql> \privileges salesDeptManagerEmployee5 
mapd (database): login-access

heavysql> \privileges marketingdeptrole2 
table2 (table): select
Marketing_Summary (dashboard): view

\role_list userName

Reports all roles granted to the given user. The userName specified must exist.

Example

heavysql> \role_list salesDeptManagerEmployee5
salesDeptRole3 
salesDeptRole2 
salesDeptRole1

\roles

Reports all roles.

Example

heavysql> \roles
marketingDeptRole1 
marketingDeptRole2 
salesDeptRole1 
salesDeptRole2 
salesDeptRole3

\u

Lists all users.

Example

heavysql> \u 
heavyai 
marketingDeptEmployee1 
marketingDeptEmployee2 
salesDeptEmployee1 
salesDeptEmployee2 
salesDeptEmployee3 
salesDeptEmployee4 
salesDeptManagerEmployee5 
marketingDeptManagerEmployee3

Example: Data Security

The following example demonstrates field-level security using two views:

view_users_limited, in which users only see three of seven fields: userid, First_Name, and Department.
view_users_full, users see all seven fields.

Source Data

Create Views

create view view_users_limited as select userid, First_Name, Department from users;
create view view_users_full as select userid, First_Name, Department, Address, City, State, Zip from users;

Create Users

create user readonly1 (password = 'rr1');
create user readonly2 (password = 'rr2');

Grant Access to Users on Database

grant access on database heavyai to readonly1, readonly2;

Create Roles

create role limited_readonly;
create role full_readonly;

Grant Roles to Users

grant limited_readonly to readonly1;
grant full_readonly to readonly2;

Grant Privilege to View Roles

grant select on view view_users_limited to limited_readonly;
grant select on view view_users_full TO full_readonly;

Verify Views

User readonly1 sees no tables, only the specific view granted, and only the three specific columns returned in the view:

heavysql> \t
heavysql> \v
view_users_limited
heavysql> select * from view_users_limited;
userid|First_Name|Department
1|Todd|C Suite
2|Don|Sales
3|Mike|Customer Success

User readonly2 sees no tables, only the specific view granted, and all seven columns returned in the view:

heavysql> \t
heavysql> \v
view_users_full
heavysql> select * from view_users_full;
userid|First_Name|Department|Address|City|State|Zip
1|Todd|C Suite|1 Front Street|San Francisco|CA|94111
2|Don|Sales|1 5th Avenue|New York|NY|10001
3|Mike|Customer Succes|100 Main Street|Reston|VA|20191

Loading Data with SQL

This topic describes several ways to load data to HEAVY.AI using SQL commands.

If there is a potential for duplicate entries, and you want to avoid loading duplicate rows, see How can I avoid creating duplicate rows? on the Troubleshooting page.
If a source file uses a reserved word, HEAVY.AI automatically adds an underscore at the end of the reserved word. For example, year is converted to year_.

COPY FROM

CSV/TSV Import

Use the following syntax for CSV and TSV files:

COPY <table> FROM '<file pattern>' [WITH (<property> = value, ...)];

<file pattern> must be local on the server. The file pattern can contain wildcards if you want to load multiple files. In addition to CSV, TSV, and TXT files, you can import compressed files in TAR, ZIP, 7-ZIP, RAR, GZIP, BZIP2, or TGZ format.

COPY FROM appends data from the source into the target table. It does not truncate the table or overwrite existing data.

You can import client-side files (\copy command in heavysql) but it is significantly slower. For large files, HEAVY.AI recommends that you first scp the file to the server, and then issue the COPY command.

HEAVYAI supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to HEAVY.AI.

Available properties in the optional WITH clause are described in the following table.

Parameter

Description

Default Value

array_delimiter

A single-character string for the delimiter between input values contained within an array.

, (comma)

array_marker

A two-character string consisting of the start and end characters surrounding an array.

{ }(curly brackets). For example, data to be inserted into a table with a string array in the second column (for example, BOOLEAN, STRING[], INTEGER) can be written as true,{value1,value2,value3},3

buffer_size

Size of the input file buffer, in bytes.

8388608

delimiter

A single-character string for the delimiter between input fields; most commonly:

, for CSV files
\t for tab-delimited files

Other delimiters include | ,~, ^, and;.

Note: OmniSci does not use file extensions to determine the delimiter.

',' (CSV file)

escape

A single-character string for escaping quotes.

'"' (double quote)

geo

Import geo data. Deprecated and scheduled for removal in a future release.

'false'

header

Either 'true' or 'false', indicating whether the input file has a header line in Line 1 that should be skipped.

'true'

line_delimiter

A single-character string for terminating each line.

'\n'

lonlat

In OmniSci, POINT fields require longitude before latitude. Use this parameter based on the order of longitude and latitude in your source data.

'true'

max_reject

Number of records that the COPY statement allows to be rejected before terminating the COPY command. Records can be rejected for a number of reasons, including invalid content in a field, or an incorrect number of columns. The details of the rejected records are reported in the ERROR log. COPY returns a message identifying how many records are rejected. The records that are not rejected are inserted into the table, even if the COPY stops because the max_reject count is reached.

Note: If you run the COPY command from OmniSci Immerse, the COPY command does not return messages to Immerse once the SQL is verified. Immerse does not show messages about data loading, or about data-quality issues that result in max_reject triggers.

100,000

nulls

A string pattern indicating that a field is NULL.

An empty string, 'NA', or \N

parquet

Import data in Parquet format. Parquet files can be compressed using Snappy. Other archives such as .gz or .zip must be unarchived before you import the data. Deprecated and scheduled for removal in a future release.

'false'

plain_text

Indicates that the input file is plain text so that it bypasses the libarchive decompression utility.

CSV, TSV, and TXT are handled as plain text.

quote

A single-character string for quoting a field.

" (double quote). All characters inside quotes are imported “as is,” except for line delimiters.

quoted

Either 'true' or 'false', indicating whether the input file contains quoted fields.

'true'

source_srid

When importing into GEOMETRY(*, 4326) columns, specifies the SRID of the incoming geometries, all of which are transformed on the fly. For example, to import from a file that contains EPSG:2263 (NAD83 / New York Long Island) geometries, run the COPY command and include WITH (source_srid=2263). Data targeted at non-4326 geometry columns is not affected.

source_type='<type>'

Type can be one of the following:

delimited_file - Import as CSV.

geo_file - Import as Geo file. Use for shapefiles, GeoJSON, and other geo files. Equivalent to deprecated geo='true'.

raster_file - Import as a raster file.

parquet_file - Import as a Parquet file. Equivalent to deprecated parquet='true'.

delimited_file

threads

Number of threads for performing the data import.

Number of CPU cores on the system

trim_spaces

Indicate whether to trim side spaces ('true') or not ('false').

'false'

By default, the CSV parser assumes one row per line. To import a file with multiple lines in a single field, specify threads = 1 in the WITH clause.

Examples

COPY tweets FROM '/tmp/tweets.csv' WITH (nulls = 'NA'); 
COPY tweets FROM '/tmp/tweets.tsv' WITH (delimiter = '\t', quoted = 'false'); 
COPY tweets FROM '/tmp/*' WITH (header='false'); 
COPY trips FROM '/mnt/trip/trip.parquet/part-00000-0284f745-1595-4743-b5c4-3aa0262e4de3-c000.snappy.parquet' with (parquet='true');

Geo Import

You can use COPY FROM to import geo files. You can create the table based on the source file and then load the data:

COPY FROM 'source' WITH (source_type='geo_file', ...);

You can also append data to an existing, predefined table:

COPY tableName FROM 'source' WITH (source_type='geo_file', ...);

Use the following syntax, depending on the file source.

Local server

COPY [tableName] FROM '/filepath' WITH (source_type='geo_file', ...);

Web site

COPY [tableName] FROM '[http _https_]://_website/filepath_' WITH (source_type='geo_file', ...);

Amazon S3

COPY [tableName] FROM 's3://bucket/filepath' WITH (source_type='geo_file', s3_region='region', s3_access_key='accesskey', s3_secret_key='secretkey', ... );

If you are using COPY FROM to load to an existing table, the field type must match the metadata of the source file. If it does not, COPY FROM throws an error and does not load the data.
COPY FROM appends data from the source into the target table. It does not truncate the table or overwrite existing data.
Supported DATE formats when using COPY FROM include mm/dd/yyyy, dd-mmm-yy, yyyy-mm-dd, and dd/mmm/yyyy.
COPY FROM fails for records with latitude or longitude values that have more than 4 decimal places.

The following WITH options are available for geo file imports from all sources.

geo_coords_type

Coordinate type used; must be geography.

N/A

geo_coords_encoding

Coordinates encoding; can be geoint(32) or none.

geoint(32)

geo_coords_srid

Coordinates spatial reference; must be 4326 (WGS84 longitude/latitude).

N/A

geo_explode_collections

Explodes MULTIPOLYGON, MULTILINESTRING, or MULTIPOINT geo data into multiple rows in a POLYGON, LINESTRING, or POINT column, with all other columns duplicated.

When importing from a WKT CSV with a MULTIPOLYGON column, the table must have been manually created with a POLYGON column.

When importing from a geo file, the table is automatically created with the correct type of column.

When the input column contains a mixture of MULTI and single geo, the MULTI geo are exploded, but the singles are imported normally. For example, a column containing five two-polygon MULTIPOLYGON rows and five POLYGON rows imports as a POLYGON column of fifteen rows.

false

geo_validate_geometry

Boolean. If enabled, the importer passes any incoming POLYGON or MULTIPOLYGON data through a validation process. If the geo is considered invalid by OGC (PostGIS) standards (for example, self-intersecting polygons), then the row or feature that contains it is rejected.

This option is available only if the optional is installed; otherwise invoking the option throws an error.

Currently, a manually created geo table can have only one geo column. If it has more than one, import is not performed.

Any GDAL-supported file type can be imported. If it is not supported, GDAL throws an error.

An ESRI file geodatabase can have multiple layers, and importing it results in the creation of one table for each layer in the file. This behavior differs from that of importing shapefiles, GeoJSON, or KML files, which results in a single table. For more information, see Importing an ESRI File Geodatabase.

The first compatible file in the bundle is loaded; subfolders are traversed until a compatible file is found. The rest of the contents in the bundle are ignored. If the bundle contains multiple filesets, unpack the file manually and specify it for import.

For more information about importing specific geo file formats, see Importing Geospatial Files.

CSV files containing WKT strings are not considered geo files and should not be parsed with the source_type='geo' option. When importing WKT strings from CSV files, you must create the table first. The geo column type and encoding are specified as part of the DDL. For example, for a polygon with no encoding, try the following:

ggpoly GEOMETRY(POLYGON, 4326) ENCODING COMPRESSED(32)

Raster Import

You can use COPY FROM to import raster files supported by GDAL as one row per pixel, where a pixel may consist of one or more data bands, with optional corresponding pixel or world-space coordinate columns. This allows the data to be rendered as a point/symbol cloud that approximates a 2D image.

COPY FROM 'source' WITH (source_type='raster_file', ...);

Use the same syntax that you would for geo files, depending on the file source.

The following WITH options are available for raster file imports from all sources.

Parameter

Description

Default Value

raster_import_bands='<bandname>[,<bandname>,...]'

Allows specification of one or more band names to selectively import; useful in the context of large raster files where not all the bands are relevant. Bands are imported in the order provided, regardless of order in the file. You can rename bands using <bandname>=<newname>[,<bandname>=<newname,...>]Names must be those discovered by the , including any suffixes for de-duplication.

An empty string, indicating to import all bands from all datasets found in the file.

raster_point_transform='<transform>'

Specifies the processing for floating-point coordinate values: auto - Transform based on raster file type (world for geo, none for non-geo).

none - No affine or world-space conversion. Values will be equivalent to the integer pixel coordinates.

file - File-space affine transform only. Values will be in the file's coordinate system, if any (e.g. geospatial).

world - World-space geospatial transform. Values will be projected to WGS84 lon/lat (if the file has a geospatial SRID).

auto

raster_point_type='<type>'

Specifies the required type for the additional pixel coordinate columns: auto - Create columns based on raster file type (double for geo, int or smallint for non-geo, dependent on size).

none - Do not create pixel coordinate columns.

smallint or int - Create integer columns of names raster_x and raster_y and fill with the raw pixel coordinates from the file.

float or double - Create floating-point columns of names raster_x and raster_y (or raster_lon and raster_lat) and fill with file-space or world-space projected coordinates.

point - Create a POINT column of name raster_point and fill with file-space or world-space projected coordinates.

auto

Illegal combinations of raster_point_type and raster_point_transform are rejected. For example, world transform can only be performed on raster files that have a geospatial coordinate system in their metadata, and cannot be performed if <type> is an integer format (which cannot represent world-space coordinate values).

Any GDAL-supported file type can be imported. If it is not supported, GDAL throws an error.

HDF5 and possibly other GDAL drivers may not be thread-safe, so use WITH (threads=1) when importing.

Archive file import (.zip, .tar, .tar.gz) is not currently supported for raster files.

Band and Column Names

The following raster file formats contain the metadata required to derive sensible names for the bands, which are then used for their corresponding columns:

GRIB2 - geospatial/meteorological format
OME TIFF - an OpenMicroscopy format

The band names from the file are sanitized (illegal characters and spaces removed) and de-duplicated (addition of a suffix in cases where the same band name is repeated within the file or across datasets).

For other formats, the columns are named band_1_1, band_1_2 , and so on.

The sanitized and de-duplicated names must be used for the raster_import_bands option.

Band and Column Data Types

Raster files can have bands in the following data types:

Signed or unsigned 8-, 16-, or 32-bit integer
32- or 64-bit floating point
Complex number formats (not supported)

Signed data is stored in the directly corresponding column type, as follows:

int8 -> TINYINT int16 -> SMALLINT int32 -> INT float32 -> FLOAT float64 -> DOUBLE

Unsigned integer column types are not currently supported, so any data of those types is converted to the next size up signed column type:

uint8 -> SMALLINT uint16 -> INT uint32 -> BIGINT

Column types cannot currently be overridden.

ODBC Import

ODBC import is currently a beta feature.

You can use COPY FROM to import data from a Relational Database Management System (RDMS) or data warehouse using the Open Database Connectivity (ODBC) interface.

COPY <table_name> FROM '<select_query>' WITH (source_type = 'odbc', ...);

The following WITH options are available for ODBC import.

data_source_name

Data source name (DSN) configured in the odbc.ini file. Only one of data_source_name or connection_string can be specified.

connection_string

A set of semicolon-separated key=value pairs that define the connection parameters for an RDMS. For example: Driver=DriverName;Database=DatabaseName;Servername=HostName;Port=1234

Only one of data_source_name or connection_string can be specified.

sql_order_by

Comma-separated list of column names that provide a unique ordering for the result set returned by the specified SQL SELECT statement.

username

Username on the RDMS. Applies only when data_source_name is used.

password

Password credential for the RDMS. This option only applies when data_source_name is used.

credential_string

A set of semicolon separated “key=value” pairs, which define the access credential parameters for an RDMS. For example:

Username=username;Password=password

Applies only when connection_string is used.

Examples

Using a data source name:

COPY example_table
  FROM 'SELECT * FROM remote_postgres_table WHERE event_timestamp > ''2020-01-01'';'
  WITH 
    (source_type = 'odbc', 
     sql_order_by = 'event_timestamp',
     data_source_name = 'postgres_db_1',
     username = 'my_username',
     password = 'my_password');

Using a connection string:

COPY example_table
  FROM 'SELECT * FROM remote_postgres_table WHERE event_timestamp > ''2020-01-01'';'
  WITH 
    (source_type = 'odbc',
     sql_order_by = 'event_timestamp',
     connection_string = 'Driver=PostgreSQL;Database=my_postgres_db;Servername=my_postgres.example.com;Port=1234',
     credential_string = 'Username=my_username;Password=my_password');

For information about using ODBC HeavyConnect, see ODBC Data Wrapper Reference.

Globbing, Filtering, and Sorting Parquet and CSV Files

These examples assume the following folder and file structure:

Globbing

Local Parquet/CSV files can now be globbed by specifying either a path name with a wildcard or a folder name.

Globbing a folder recursively returns all files under the specified folder. For example,

COPY table_1 FROM ".../subdir";

returns file_3, file_4, file_5.

Globbing with a wildcard returns any file paths matching the expanded file path. So

COPY table_1 FROM ".../subdir/file*"; returns file_3, file_4.

Does not apply to S3 cases, because file paths specified for S3 always use prefix matching.

Filtering

Use file filtering to filter out unwanted files that have been globbed. To use filtering, specify the REGEX_PATH_FILTER option. Files not matching this pattern are not included on import. Consistent across local and S3 use cases.

The following regex expression:

COPY table_1 from ".../" WITH (REGEX_PATH_FILTER=".*file_[4-5]");

returns file_4, file_5.

Sorting

Use the FILE_SORT_ORDER_BY option to specify the order in which files are imported.

FILE_SORT_ORDER_BY Options

pathname (default)
date_modified
regex *
regex_date *
regex_number *

*FILE_SORT_REGEX option required

Using FILE_SORT_ORDER_BY

COPY table_1 from ".../" WITH (FILE_SORT_ORDER_BY="date_modified");

Using FILE_SORT_ORDER_BY with FILE_SORT_REGEX

Regex sort keys are formed by the concatenation of all capture groups from the FILE_SORT_REGEX expression. Regex sort keys are strings but can be converted to dates or FLOAT64 with the appropriate FILE_SORT_ORDER_BY option. File paths that do not match the provided capture groups or that cannot be converted to the appropriate date or FLOAT64 are treated as NULLs and sorted to the front in a deterministic order.

Multiple Capture Groups:

FILE_SORT_REGEX=".*/data_(.*)_(.*)_" /root/dir/unmatchedFile → <NULL> /root/dir/data_andrew_54321_ → andrew54321 /root/dir2/data_brian_Josef_ → brianJosef

Dates:

FILE_SORT_REGEX=".*data_(.*) /root/data_222 → <NULL> (invalid date conversion) /root/data_2020-12-31 → 2020-12-31 /root/dir/data_2021-01-01 → 2021-01-01

Import:

COPY table_1 from ".../" WITH (FILE_SORT_ORDER_BY="regex", FILE_SORT_REGEX=".*file_(.)");

Geo and Raster File Globbing

Limited filename globbing is supported for both geo and raster import. For example, to import a sequence of same-format GeoTIFF files into a single table, you can run the following:

COPY table FROM '/path/path/something_*.tiff' WITH (source_type='raster_file')

The files are imported in alphanumeric sort order, per regular glob rules, and all appended to the same table. This may fail if the files are not all of the same format (band count, names, and types).

For non-geo/raster files (CSV and Parquet), you can provide just the path to the directory OR a wildcard; for example:

/path/to/directory/ /path/to/directory /path/to/directory/*

For geo/raster files, a wildcard is required, as shown in the last example.

SQLImporter

SQLImporter is a Java utility run at the command line. It runs a SELECT statement on another database through JDBC and loads the result set into HeavyDB.

Usage

java -cp [HEAVY.AI utility jar file]:[3rd party JDBC driver]
SQLImporter
-u <userid>; -p <password>; [(--binary|--http|--https [--insecure])]
-s <heavyai server host> -db <omnsci db> --port <heavyai server port>
[-d <other database JDBC drive class>] -c <other database JDBC connection string>
-su <other database user> -sp <other database user password> -ss <other database sql statement>
-t <HEAVY.AI target table> -b <transfer buffer size> -f <table fragment size>
[-tr] [-nprg] [-adtf] [-nlj] -i <init commands file>

Flags

-r                                     Row load limit 
-h,--help                              Help message
-r <arg>;                              Row load limit 
-h,--help                              Help message 
-u,--user <arg>;                       HEAVY.AI user 
-p,--passwd <arg>;                     HEAVY.AI password 
--binary                               Use binary transport to connect to HEAVY.AI 
--http                                 Use http transport to connect to HEAVY.AI 
--https                                Use https transport to connect to HEAVY.AI 
-s,--server <arg>;                     HEAVY.AI Server 
-db,--database <arg>;                  HEAVY.AI Database 
--port <arg>;                          HEAVY.AI Port 
--ca-trust-store <arg>;                CA certificate trust store 
--ca-trust-store-passwd <arg>;         CA certificate trust store password 
--insecure <arg>;                      Insecure TLS - Do not validate server HEAVY.AI 
                                       server certificates 
-d,--driver <arg>;                     JDBC driver class 
-c,--jdbcConnect <arg>;                JDBC connection string 
-su,--sourceUser <arg>;                Source user 
-sp,--sourcePasswd <arg>;              Source password 
-ss,--sqlStmt <arg>;                   SQL Select statement 
-t,--targetTable <arg>;                HEAVY.AI Target Table 
-b,--bufferSize <arg>;                 Transfer buffer size 
-f,--fragmentSize <arg>;               Table fragment size 
-tr,--truncate                         Truncate table if it exists 
-nprg,--noPolyRenderGroups             Disable render group assignment  
-adtf,--allowDoubleToFloat             Allow narrow casting
-nlj,--no-log-jdbc-connection-string   Omit JDBC connection string from logs   
-i,--initializeFile <arg>;             File containing init command for DB

HEAVY.AI recommends that you use a service account with read-only permissions when accessing data from a remote database.

In release 4.6 and higher, the user ID (-u) and password (-p) flags are required. If your password includes a special character, you must escape the character using a backslash (\).

If the table does not exist in HeavyDB, SQLImporter creates it. If the target table in HeavyDB does not match the SELECT statement metadata, SQLImporter fails.

If the truncate flag is used, SQLImporter truncates the table in HeavyDB before transferring the data. If the truncate flag is not used, SQLImporter appends the results of the SQL statement to the target table in HeavyDB.

The -i argument provides a path to an initialization file. Each line of the file is sent as a SQL statement to the remote database. You can use -i to set additional custom parameters before the data is loaded.

The SQLImporter string is case-sensitive. Incorrect case returns the following:

Error: Could not find or load main class com.mapd.utility.SQLimporter

PostgreSQL/PostGIS Support

You can migrate geo data types from a PostgreSQL database. The following table shows the correlation between PostgreSQL/PostGIS geo types and HEAVY.AI geo types.

point

lseg

linestring

polygon

multipolygon

Other PostgreSQL types, including circle, box, and path, are not supported.

HeavyDB Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version}>.jar 
com.mapd.utility.SQLImporter -u admin -p HyperInteractive -db heavyai --port 6274 
-t mytable -su admin -sp HyperInteractive -c "jdbc:heavyai:myhost:6274:heavyai" 
-ss "select * from mytable limit 1000000000"

By default, 100,000 records are selected from HeavyDB. To select a larger number of records, use the LIMIT statement.

Hive Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version>.jar:/hive-jdbc-1.2.1000.2.6.1.0-129-standalone.jar
com.mapd.utility.SQLImporter
-u user -p password
-db Heavyai_database_name --port 6274 -t Heavyai_table_name
-su source_user -sp source_password
-c "jdbc:hive2://server_address:port_number/database_name"
-ss "select * from source_table_name"

Google Big Query Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version>.jar:./GoogleBigQueryJDBC42.jar:
./google-oauth-client-1.22.0.jar:./google-http-client-jackson2-1.22.0.jar:./google-http-client-1.22.0.jar:./google-api-client-1.22.0.jar:
./google-api-services-bigquery-v2-rev355-1.22.0.jar 
com.mapd.utility.SQLImporter
-d com.simba.googlebigquery.jdbc42.Driver 
-u user -p password
-db Heavyai_database_name --port 6274 -t Heavyai_table_name
-su source_user -sp source_password 
-c "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=project-id;OAuthType=0;
[email protected];OAuthPvtKeyPath=/home/simba/myproject.json;"
-ss "select * from schema.source_table_name"

PostgreSQL Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version>.jar:/tmp/postgresql-42.2.5.jar 
com.mapd.utility.SQLImporter 
-u user -p password
-db Heavyai_database_name --port 6274 -t Heavyai_table_name
-su source_user -sp source_password 
-c "jdbc:postgresql://127.0.0.1/postgres"
-ss "select * from schema_name.source_table_name"

SQLServer Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version>.jar:/path/sqljdbc4.jar
com.mapd.utility.SQLImporter
-d com.microsoft.sqlserver.jdbc.SQLServerDriver 
-u user -p password
-db Heavyai_database_name --port 6274 -t Heavyai_table_name
-su source_user -sp source_password 
-c "jdbc:sqlserver://server:port;DatabaseName=database_name"
-ss "select top 10 * from dbo.source_table_name"

MySQL Example

java -cp /opt/heavyai/bin/heavyai-utility-<db-version>.jar:mysql/mysql-connector-java-5.1.38-bin.jar
com.mapd.utility.SQLImporter 
-u user -p password
-db Heavyai_database_name --port 6274 -t Heavyai_table_name
-su source_user -sp source_password 
-c "jdbc:mysql://server:port/database_name"
-ss "select * from schema_name.source_table_name"

StreamInsert

Stream data into HeavyDB by attaching the StreamInsert program to the end of a data stream. The data stream can be another program printing to standard out, a Kafka endpoint, or any other real-time stream output. You can specify the appropriate batch size, according to the expected stream rates and your insert frequency. The target table must exist before you attempt to stream data into the table.

<data stream> | StreamInsert <table name> <database name> \
{-u|--user} <user> {-p|--passwd} <password> [{--host} <hostname>] \
[--port <port number>][--delim <delimiter>][--null <null string>] \
[--line <line delimiter>][--batch <batch size>][{-t|--transform} \
transformation ...][--retry_count <num_of_retries>] \
[--retry_wait <wait in secs>][--print_error][--print_transform]

Setting

Default

Description

<table_name>

n/a

Name of the target table in OmniSci

<database_name>

n/a

Name of the target database in OmniSci

-u

n/a

User name

-p

n/a

User password

--host

n/a

Name of OmniSci host

--delim

comma (,)

Field delimiter, in single quotes

--line

newline (\n)

Line delimiter, in single quotes

--batch

10000

Number of records in a batch

--retry_count

Number of attempts before job fails

--retry_wait

Wait time in seconds after server connection failure

--null

n/a

String that represents null values

--port

6274

Port number for OmniSciDB on localhost

`-t

--transform`

n/a

Regex transformation

--print_error

False

Print error messages

--print_transform

False

Print description of transform.

--help

n/a

List options

For more information on creating regex transformation statements, see RegEx Replace.

Example

cat file.tsv | /path/to/heavyai/SampleCode/StreamInsert stream_example \
heavyai --host localhost --port 6274 -u imauser -p imapassword \
--delim '\t' --batch 1000

Importing AWS S3 Files

You can use the SQL COPY FROM statement to import files stored on Amazon Web Services Simple Storage Service (AWS S3) into an HEAVY.AI table, in much the same way you would with local files. In the WITH clause, specify the S3 credentials and region information of the bucket accessed.

COPY <table> FROM '<S3_file_URL>' WITH ([[s3_access_key = '<key_name>',s3_secret_key = '<key_secret>',] | [s3_session_token - '<AWS_session_token']] s3_region = '<region>');

Access key and secret key, or session token if using temporary credentials, and region are required. For information about AWS S3 credentials, see https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys.

HEAVY.AI does not support the use of asterisks (*) in URL strings to import items. To import multiple files, pass in an S3 path instead of a file name, and COPY FROM imports all items in that path and any subpath.

Custom S3 Endpoints

HEAVY.AI supports custom S3 endpoints, which allows you to import data from S3-compatible services, such as Google Cloud Storage.

To use custom S3 endpoints, add s3_endpoint to the WITH clause of a COPY FROM statement; for example, to set the S3 endpoint to point to Google Cloud Services:

COPY trips FROM 's3://heavyai-importtest-data/trip-data/trip_data_9.gz' WITH (header='true', s3_endpoint='storage.googleapis.com');

For information about interoperability and setup for Google Cloud Services, see Cloud Storage Interoperability.

You can also configure custom S3 endpoints by passing the s3_endpoint field to Thrift import_table.

Examples

The following examples show failed and successful attempts to copy the table trips from AWS S3.

heavysql> COPY trips FROM 's3://heavyai-s3-no-access/trip_data_9.gz';
Exception: failed to list objects of s3 url 's3://heavyai-s3-no-access/trip_data_9.gz': AccessDenied: Access Denied

heavysql> COPY trips FROM 's3://heavyai-s3-no-access/trip_data_9.gz' with (s3_access_key='xxxxxxxxxx',s3_secret_key='yyyyyyyyy');
Exception: failed to list objects of s3 url 's3://heavyai-s3-no-access/trip_data_9.gz': AuthorizationHeaderMalformed: Unable to parse ExceptionName: AuthorizationHeaderMalformed Message: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-1'

heavysql> COPY trips FROM 's3://heavyai-testdata/trip.compressed/trip_data_9.csv' with (s3_access_key='xxxxxxxx',s3_secret_key='yyyyyyyy',s3_region='us-west-1');
Result
Loaded: 100 recs, Rejected: 0 recs in 0.361000 secs

The following example imports all the files in the trip.compressed directory.

heavysql> copy trips from 's3://heavyai-testdata/trip.compressed/' with (s3_access_key='xxxxxxxx',s3_secret_key='yyyyyyyy',s3_region='us-west-1');
Result
Loaded: 105200 recs, Rejected: 0 recs in 1.890000 secs

trips Table

The table trips is created with the following statement:

heavysql> \d trips
        CREATE TABLE trips (
        medallion TEXT ENCODING DICT(32),
        hack_license TEXT ENCODING DICT(32),
        vendor_id TEXT ENCODING DICT(32),
        rate_code_id SMALLINT,
        store_and_fwd_flag TEXT ENCODING DICT(32),
        pickup_datetime TIMESTAMP,
        dropoff_datetime TIMESTAMP,
        passenger_count SMALLINT,
        trip_time_in_secs INTEGER,
        trip_distance DECIMAL(14,2),
        pickup_longitude DECIMAL(14,2),
        pickup_latitude DECIMAL(14,2),
        dropoff_longitude DECIMAL(14,2),
        dropoff_latitude DECIMAL(14,2))
WITH (FRAGMENT_SIZE = 75000000);

Using Server Privileges to Access AWS S3

You can configure HEAVY.AI server to provide AWS credentials, which allows S3 Queries to be run without specifying AWS credentials. S3 Regions are not configured by the server, and will need to be passed in either as a client side environment variable or as an option with the request.

Example Commands

\detect: $ export AWS_REGION=us-west-1 heavysql > \detect <s3-bucket-uri
import_table: $ ./Heavyai-remote -h localhost:6274 import_table "'<session-id>'" "<table-name>" '<s3-bucket-uri>' 'TCopyParams(s3_region="'us-west-1'")'
COPY FROM: heavysql > COPY <table-name> FROM <s3-bucket-uri> WITH(s3_region='us-west-1');

Configuring AWS Credentials

Enable server privileges in the server configuration file heavy.conf allow-s3-server-privileges = true
For bare metal installations set the following environment variables and restart the HeavyDB service: AWS_ACCESS_KEY_ID=xxx AWS_SECRET_ACCESS_KEY=xxx AWS_SESSION_TOKEN=xxx (required only for AWS STS credentials)
For HeavyDB docker images, start a new container mounted with the configuration file using the option: -v <dirname-containing-heavy.conf>:/var/lib/heavyaiand set the following environment options: -e AWS_ACCESS_KEY_ID=xxx -e AWS_SECRET_ACCESS_KEY=xxx -e AWS_SESSION_TOKEN=xxx (required only for AWS STS credentials)

Enable server privileges in the server configuration file heavy.conf allow-s3-server-privileges = true
For bare metal installations Specify a shared AWS credentials file and profile with the following environment variables and restart the HeavyDB service. AWS_SHARED_CREDENTIALS_FILE=~/.aws/credentials AWS_PROFILE=default
For HeavyDB docker images, start a new container mounted with the configuration file and AWS shared credentials file using the following options: -v <dirname-containing-/heavy.conf>:/var/lib/heavyai -v <dirname-containing-/credentials>:/<container-credential-path>and set the following environment options: -e AWS_SHARED_CREDENTIALS_FILE=<container-credential-path> -e AWS_PROFILE=<active-profile>

Prerequisites

An IAM Policy that has sufficient access to the S3 bucket.
An IAM AWS Service Role of type Amazon EC2 , which is assigned the IAM Policy from (1).

Setting Up an EC2 Instance with Roles

For a new EC2 Instance:

AWS Management Console > Services > Compute > EC2 > Launch Instance.
Select desired Amazon Machine Image (AMI) > Select.
Select desired Instance Type > Next: Configure Instance Details.
IAM Role > Select desired IAM Role > Review and Launch.
Review other options > Launch.

For an existing EC2 Instance:

AWS Management Console > Services > Compute > EC2 > Instances.
Mark desired instance(s) > Actions > Security > Modify IAM Role.
Select desired IAM Role > Save.
Restart the EC2 Instance.

KafkaImporter

You can ingest data from an existing Kafka producer to an existing table in HEAVY.AI using KafkaImporter on the command line:

KafkaImporter <table_name> <database_name> {-u|--user <user_name> \
{-p|--passwd <user_password>} [{--host} <hostname>] \
[--port <HeavyDB_port>] [--http] [--https] [--skip-verify] \
[--ca-cert <path>] [--delim <delimiter>] [--batch <batch_size>] \
[{-t|--transform} transformation ...] [retry_count <retry_number>] \
[--retry_wait <delay_in_seconds>] --null <null_value_string> [--quoted true|false] \
[--line <line_delimiter>] --brokers=<broker_name:broker_port> \ 
--group-id=<kafka_group_id> --topic=<topic_type> [--print_error] [--print_transform]

KafkaImporter requires a functioning Kafka cluster. See the Kafka website and the Confluent schema registry documentation.

KafkaImporter Options

Setting

Default

Description

<table_name>

n/a

Name of the target table in OmniSci

<database_name>

n/a

Name of the target database in OmniSci

-u <username>

n/a

User name

-p <password>

n/a

User password

--host <hostname>

localhost

Name of OmniSci host

--port <port_number>

6274

Port number for OmniSciDB on localhost

--http

n/a

Use HTTP transport

--https

n/a

Use HTTPS transport

--skip-verify

n/a

Do not verify validity of SSL certificate

--ca-cert <path>

n/a

Path to the trusted server certificate; initiates an encrypted connection

--delim <delimiter>

comma (,)

Field delimiter, in single quotes

--line <delimiter>

newline (\n)

Line delimiter, in single quotes

--batch <batch_size>

10000

Number of records in a batch

--retry_count <retry_number>

Number of attempts before job fails

--retry_wait <seconds>

Wait time in seconds after server connection failure

--null <string>

n/a

String that represents null values

--quoted <boolean>

false

Whether the source contains quoted fields

`-t

--transform`

n/a

Regex transformation

--print_error

false

Print error messages

--print_transform

false

Print description of transform

--help

n/a

List options

--group-id <id>

n/a

Kafka group ID

--topic <topic>

n/a

The Kafka topic to be ingested

--brokers <broker_name:broker_port>

localhost:9092

One or more brokers

KafkaImporter Logging Options

Setting

Default

Description

--log-directory <directory>

mapd_log

Logging directory; can be relative to data directory or absolute

--log-file-name <filename>

n/a

Log filename relative to logging directory; has format KafkaImporter.{SEVERITY}.%Y%m%d-%H%M%S.log

--log-symlink <symlink>

n/a

Symlink to active log; has format KafkaImporter.{SEVERITY}

--log-severity <level>

INFO

Log-to-file severity level: INFO, WARNING, ERROR, or FATAL

--log-severity-clog <level>

ERROR

Log-to-console severity level: INFO, WARNING, ERROR, or FATAL

--log-channels

n/a

Log channel debug info

--log-auto-flush

n/a

Flush logging buffer to file after each message

--log-max-files <files_number>

100

Maximum number of log files to keep

--log-min-free-space <bytes>

20,971,520

Minimum number of bytes available on the device before oldest log files are deleted

--log-rotate-daily

Start new log files at midnight

--log-rotation-size <bytes>

10485760

Maximum file size, in bytes, before new log files are created

Configure KafkaImporter to use your target table. KafkaImporter listens to a pre-defined Kafka topic associated with your table. You must create the table before using the KafkaImporter utility. For example, you might have a table named customer_site_visit_events that listens to a topic named customer_site_visit_events_topic.

The data format must be a record-level format supported by HEAVY.AI.

KafkaImporter listens to the topic, validates records against the target schema, and ingests topic batches of your designated size to the target table. Rejected records use the existing reject reporting mechanism. You can start, shut down, and configure KafkaImporter independent of the HeavyDB engine. If KafkaImporter is running and the database shuts down, KafkaImporter shuts down as well. Reads from the topic are nondestructive.

KafkaImporter is not responsible for event ordering; a streaming platform outside HEAVY.AI (for example, Spark streaming, flink) should handle the stream processing. HEAVY.AI ingests the end-state stream of post-processed events.

KafkaImporter does not handle dynamic schema creation on first ingest, but must be configured with a specific target table (and its schema) as the basis. There is a 1:1 correspondence between target table and topic.

cat tweets.tsv | -./KafkaImporter tweets_small heavyai-u imauser-p imapassword--delim '\t'--batch 100000--retry_count 360--retry_wait 10--null null--port 9999--brokers=localhost:9092--group-id=testImport1--topic=tweet
cat tweets.tsv | ./KafkaImporter tweets_small heavyai
-u imauser
-p imapassword
--delim '\t'
--batch 100000
--retry_count 360
--retry_wait 10
--null null
--port 9999
--brokers=localhost:9092
--group-id=testImport1
--topic=tweet

StreamImporter

StreamImporter is an updated version of the StreamInsert utility used for streaming reads from delimited files into HeavyDB. StreamImporter uses a binary columnar load path, providing improved performance compared to StreamInsert.

You can ingest data from a data stream to an existing table in HEAVY.AI using StreamImporter on the command line.

StreamImporter <table_name> <database_name> {-u|--user <user_name> \
{-p|--passwd <user_password>} [{--host} <hostname>] [--port <HeavyDB_port>] \
[--http] [--https] [--skipverify] [--ca-cert <path>] [--delim <delimiter>] \
[--null <null string>] [--line <line delimiter>]  [--quoted <boolean>] \
 [--batch <batch_size>] [{-t|--transform} transformation ...] \
[retry_count <number_of_retries>] [--retry_wait <delay_in_seconds>]  \
[--print_error] [--print_transform]

StreamImporter Options

Setting

Default

Description

<table_name>

n/a

Name of the target table in OmniSci

<database_name>

n/a

Name of the target database in OmniSci

-u <username>

n/a

User name

-p <password>

n/a

User password

--host <hostname>

n/a

Name of OmniSci host

--port <port>

6274

Port number for OmniSciDB on localhost

--http

n/a

Use HTTP transport

--https

n/a

Use HTTPS transport

--skip-verify

n/a

Do not verify validity of SSL certificate

--ca-cert <path>

n/a

Path to the trusted server certificate; initiates an encrypted connection

--delim <delimiter>

comma (,)

Field delimiter, in single quotes

--null <string>

n/a

String that represents null values

--line <delimiter>

newline (\n)

Line delimiter, in single quotes

--quoted <boolean>

true

Either true or false, indicating whether the input file contains quoted fields.

--batch <number>

10000

Number of records in a batch

--retry_count <retry_number>

Number of attempts before job fails

--retry_wait <seconds>

Wait time in seconds after server connection failure

`-t

--transform`

n/a

Regex transformation

--print_error

false

Print error messages

--print_transform

false

Print description of transform

--help

n/a

List options

StreamImporter Logging Options

Setting

Default

Description

--log-directory <directory>

mapd_log

Logging directory; can be relative to data directory or absolute

--log-file-name <filename>

n/a

Log filename relative to logging directory; has format StreamImporter.{SEVERITY}.%Y%m%d-%H%M%S.log

--log-symlink <symlink>

n/a

Symlink to active log; has format StreamImporter.{SEVERITY}

--log-severity <level>

INFO

Log-to-file severity level: INFO, WARNING, ERROR, or FATAL

--log-severity-clog <level>

ERROR

Log-to-console severity level: INFO, WARNING, ERROR, or FATAL

--log-channels

n/a

Log channel debug info

--log-auto-flush

n/a

Flush logging buffer to file after each message

--log-max-files <files_number>

100

Maximum number of log files to keep

--log-min-free-space <bytes>

20,971,520

Minimum number of bytes available on the device before oldest log files are deleted

--log-rotate-daily

Start new log files at midnight

--log-rotation-size <bytes>

10485760

Maximum file size, in bytes, before new log files are created

Configure StreamImporter to use your target table. StreamImporter listens to a pre-defined data stream associated with your table. You must create the table before using the StreamImporter utility.

The data format must be a record-level format supported by HEAVY.AI.

StreamImporter listens to the stream, validates records against the target schema, and ingests batches of your designated size to the target table. Rejected records use the existing reject reporting mechanism. You can start, shut down, and configure StreamImporter independent of the HeavyDB engine. If StreamImporter is running but the database shuts down, StreamImporter shuts down as well. Reads from the stream are non-destructive.

StreamImporter is not responsible for event ordering - a first class streaming platform outside HEAVY.AI (for example, Spark streaming, flink) should handle the stream processing. HEAVY.AI ingests the end-state stream of post-processed events.

StreamImporter does not handle dynamic schema creation on first ingest, but must be configured with a specific target table (and its schema) as the basis.

There is a 1:1 correspondence between target table and a stream record.

cat tweets.tsv | ./StreamImporter tweets_small heavyai
-u imauser
-p imapassword
--delim '\t'
--batch 100000
--retry_count 360
--retry_wait 10
--null null
--port 9999

Importing Data from HDFS with Sqoop

You can consume a CSV or Parquet file residing in HDFS (Hadoop Distributed File System) into HeavyDB.

Copy the HEAVY.AI JDBC driver into the Apache Sqoop library, normally found at /usr/lib/sqoop/lib/.

Example

The following is a straightforward import command. For more information on options and parameters for using Apache Sqoop, see the user guide at sqoop.apache.org.

sqoop-export --table iAmATable \
--export-dir /user/cloudera/ \
--connect "jdbc:heavyai:000.000.000.0:6274:heavyai" \
--driver com.heavyai.jdbc.HeavyaiDriver \
--username imauser \
--password imapassword \
--direct \
--batch

The --connect parameter is the address of a valid JDBC port on your HEAVY.AI instance.

Troubleshooting: Avoiding Duplicate Rows

To detect duplication prior to loading data into HeavyDB, you can perform the following steps. For this example, the files are labeled A,B,C...Z.

Load file A into table MYTABLE.
Run the following query.
```
select count(t1.uniqueCol) as dups from MYTABLE t1 join MYTABLE t2 on t1.uCol = t2.uCol;
```
There should be no rows returned; if rows are returned, your first A file is not unique.
Load file B into table TEMPTABLE.
Run the following query.
```
select count(t1.uniqueCol) as dups from MYTABLE t1 join MYTABLE t2 on t1.uCol = t2.uCol;
```
There should be no rows returned if file B is unique. Fix B if the information is not unique using details from the selection.
Load the fixed B file into MYFILE.
Drop table TEMPTABLE.
Repeat steps 3-6 for the rest of the set for each file prior to loading the data to the real MYTABLE instance.