HEAVY.AI Docs
v8.1.0
v8.1.0
  • Welcome to HEAVY.AI Documentation
  • Overview
    • Overview
    • Release Notes
  • Installation and Configuration
    • System Requirements
      • Hardware Reference
      • Software Requirements
      • Licensing
    • Installation
      • Free Version
      • Installing on Docker
        • HEAVY.AI Installation using Docker on Ubuntu
      • Installing on Ubuntu
        • HEAVY.AI Installation on Ubuntu
        • Install NVIDIA Drivers and Vulkan on Ubuntu
      • Installing on Rocky Linux / RHEL
        • HEAVY.AI Installation on RHEL
        • Install NVIDIA Drivers and Vulkan on Rocky Linux and RHEL
      • Getting Started on AWS
      • Getting Started on GCP
      • Getting Started on Azure
      • Getting Started on Kubernetes (BETA)
      • Upgrading
        • Upgrading HEAVY.AI
        • Upgrading from Omnisci to HEAVY.AI 6.0
        • CUDA Compatibility Drivers
      • Uninstalling
      • Ports
    • Services and Utilities
      • Using Services
      • Using Utilities
    • Executor Resource Manager
    • Configuration Parameters
      • Overview
      • Configuration Parameters for HeavyDB
      • Configuration Parameters for HEAVY.AI Web Server
      • Configuration Parameters for HeavyIQ
    • Security
      • Roles and Privileges
        • Column-Level Security
      • Connecting Using SAML
      • Implementing a Secure Binary Interface
      • Encrypted Credentials in Custom Applications
      • LDAP Integration
    • Distributed Configuration
  • Loading and Exporting Data
    • Supported Data Sources
      • Kafka
      • Using HeavyImmerse Data Manager
      • Importing Geospatial Data
    • Command Line
      • Loading Data with SQL
      • Exporting Data
  • SQL
    • Data Definition (DDL)
      • Datatypes
      • Users and Databases
      • Tables
      • System Tables
      • Views
      • Policies
      • Comment
    • Data Manipulation (DML)
      • SQL Capabilities
        • ALTER SESSION SET
        • ALTER SYSTEM CLEAR
        • DELETE
        • EXPLAIN
        • INSERT
        • KILL QUERY
        • LIKELY/UNLIKELY
        • SELECT
        • SHOW
        • UPDATE
        • Arrays
        • Logical Operators and Conditional and Subquery Expressions
        • Table Expression and Join Support
        • Type Casts
      • Geospatial Capabilities
        • Uber H3 Hexagonal Modeling
      • Functions and Operators
      • System Table Functions
        • generate_random_strings
        • generate_series
        • tf_compute_dwell_times
        • tf_feature_self_similarity
        • tf_feature_similarity
        • tf_geo_rasterize
        • tf_geo_rasterize_slope
        • tf_graph_shortest_path
        • tf_graph_shortest_paths_distances
        • tf_load_point_cloud
        • tf_mandelbrot*
        • tf_point_cloud_metadata
        • tf_raster_contour_lines; tf_raster_contour_polygons
        • tf_raster_graph_shortest_slope_weighted_path
        • tf_rf_prop_max_signal (Directional Antennas)
        • ts_rf_prop_max_signal (Isotropic Antennas)
        • tf_rf_prop
      • Window Functions
      • Reserved Words
      • SQL Extensions
      • HeavyIQ LLM_TRANSFORM
  • HeavyImmerse
    • Introduction to HeavyImmerse
    • Admin Portal
    • Control Panel
    • Working with Dashboards
      • Dashboard List
      • Creating a Dashboard
      • Configuring a Dashboard
      • Duplicating and Sharing Dashboards
    • Measures and Dimensions
    • Using Parameters
    • Using Filters
    • Using Cross-link
    • Chart Animation
    • Multilayer Charts
    • SQL Editor
    • Customization
    • Joins (Beta)
    • Chart Types
      • Overview
      • Bubble
      • Choropleth
      • Combo
      • Contour
      • Cross-Section
      • Gauge
      • Geo Heatmap
      • Heatmap
      • Linemap
      • Number
      • Pie
      • Pointmap
      • Scatter Plot
      • Skew-T
      • Table
      • Text Widget
      • Wind Barb
    • Deprecated Charts
      • Bar
      • Combo - Original
      • Histogram
      • Line
      • Stacked Bar
    • HeavyIQ SQL Notebook
  • HEAVYIQ Conversational Analytics
    • HeavyIQ Overview
      • HeavyIQ Guidance
  • HeavyRF
    • Introduction to HeavyRF
    • Getting Started
    • HeavyRF Table Functions
  • HeavyConnect
    • HeavyConnect Release Overview
    • Getting Started
    • Best Practices
    • Examples
    • Command Reference
    • Parquet Data Wrapper Reference
    • ODBC Data Wrapper Reference
    • Raster Data Wrapper Reference
  • HeavyML (BETA)
    • HeavyML Overview
    • Clustering Algorithms
    • Regression Algorithms
      • Linear Regression
      • Random Forest Regression
      • Decision Tree Regression
      • Gradient Boosting Tree Regression
    • Principal Components Analysis
  • Python / Data Science
    • Data Science Foundation
    • JupyterLab Installation and Configuration
    • Using HEAVY.AI with JupyterLab
    • Python User-Defined Functions (UDFs) with the Remote Backend Compiler (RBC)
      • Installation
      • Registering and Using a Function
      • User-Defined Table Functions
      • RBC UDF/UDTF Example Notebooks
      • General UDF/UDTF Tutorial Notebooks
      • RBC API Reference
    • Ibis
    • Interactive Data Exploration with Altair
    • Additional Examples
      • Forecasting with HEAVY.AI and Prophet
  • APIs and Interfaces
    • Overview
    • heavysql
    • Thrift
    • JDBC
    • ODBC
    • Vega
      • Vega Tutorials
        • Vega at a Glance
        • Getting Started with Vega
        • Getting More from Your Data
        • Creating More Advanced Charts
        • Using Polys Marks Type
        • Vega Accumulator
        • Using Transform Aggregation
        • Improving Rendering with SQL Extensions
      • Vega Reference Overview
        • data Property
        • projections Property
        • scales Property
        • marks Property
      • Migration
        • Migrating Vega Code to Dynamic Poly Rendering
      • Try Vega
    • RJDBC
    • SQuirreL SQL
    • heavyai-connector
  • Tutorials and Demos
    • Loading Data
    • Using Heavy Immerse
    • Hello World
    • Creating a Kafka Streaming Application
    • Getting Started with Open Source
    • Try Vega
  • Troubleshooting and Special Topics
    • FAQs
    • Troubleshooting
    • Vulkan Renderer
    • Optimizing
    • Known Issues and Limitations
    • Logs and Monitoring
    • Archived Release Notes
      • Release 6.x
      • Release 5.x
      • Release 4.x
      • Release 3.x
Powered by GitBook
On this page
  • HEAVY.AI Distributed Cluster Components
  • The HEAVY.AI Aggregator
  • String Dictionary Server
  • Replicated Tables
  • Data Loading
  • Data Compression
  • HEAVY.AI Distributed Cluster Example
  • Implementing a HEAVY.AI Distributed Cluster
Export as PDF
  1. Installation and Configuration

Distributed Configuration

PreviousLDAP IntegrationNextSupported Data Sources

When installing a distributed cluster, you must run initdb --skip-geo to avoid the automatic creation of the sample geospatial data table. Otherwise, metadata across the cluster falls out of synchronization and can put the server in an unusable state.

HEAVY.AI supports distributed configuration, which allows single queries to span more than one physical host when the scale of the data is too large to fit on a single machine.

In addition to increased capacity, distributed configuration has other advantages:

  • Writes to the database can be distributed across the nodes, thereby speeding up import.

  • Reads from disk are accelerated.

  • Additional GPUs in a distributed cluster can significantly increase read performance in many usage scenarios. Performance scales linearly, or near linearly, with the number of GPUs, for simple queries requiring little communication between servers.

  • Multiple GPUs across the cluster query data on their local hosts. This allows processing of larger datasets, distributed across multiple servers.

HEAVY.AI Distributed Cluster Components

A HEAVY.AI distributed database consists of three components:

  • An aggregator, which is a specialized HeavyDB instance for managing the cluster

  • One or more leaf nodes, each being a complete HeavyDB instance for storing and querying data

  • A String Dictionary Server, which is a centralized repository for all dictionary-encoded items

Conceptually, a HEAVY.AI distributed database is horizontally sharded across n leaf nodes. Each leaf node holds one nth of the total dataset. Sharding currently is round-robin only. Queries and responses are orchestrated by a HEAVY.AI Aggregator server.

The HEAVY.AI Aggregator

Clients interact with the aggregator. The aggregator orchestrates execution of a query across the appropriate leaf nodes. The aggregator composes the steps of the query execution plan to send to each leaf node, and manages their results. The full query execution might require multiple iterations between the aggregator and leaf nodes before returning a result to the client.

A core feature of the HeavyDB is back-end, GPU-based rendering for data-rich charts such as point maps. When running as a distributed cluster, the backend rendering is distributed across all leaf nodes, and the aggregator composes the final image.

String Dictionary Server

The String Dictionary Server manages and allocates IDs for dictionary-encoded fields, ensuring that these IDs are consistent across the entire cluster.

The server creates a new ID for each new encoded value. For queries returning results from encoded fields, the IDs are automatically converted to the original values by the aggregator. Leaf nodes use the string dictionary for processing joins on encoded columns.

For moderately sized configurations, the String Dictionary Server can share a host with a leaf node. For larger clusters, this service can be configured to run on a small, separate CPU-only server.

Replicated Tables

A table is split by default to 1/nth of the complete dataset. When you create a table used to provide dimension information, you can improve performance by replicating its contents onto every leaf node using the partitions property. For example:

CREATE TABLE flights … WITH (PARTITIONS='REPLICATED')

This reduces the distribution overhead during query execution in cases where sharding is not possible or appropriate. This is most useful for relatively small, heavily used dimension tables.

Data Loading

You can load data to a HEAVY.AI distributed cluster using a COPY FROM statement to load data to the aggregator, exactly as with HEAVY.AI single-node processing. The aggregator distributes data evenly across the leaf nodes.

Data Compression

Records transferred between systems in a HEAVY.AI cluster are compressed to improve performance. HEAVY.AI uses the LZ4_HC compressor by default. It is the fastest compressor, but has the lowest compression rate of the available algorithms. The time required to compress each buffer is directly proportional to the final compressed size of the data. A better compression rate will likely require more time to process.

You can specify another compressor on server startup using the runtime flag compressor. Compressor choices include:

  • blosclz

  • lz4

  • lz4hc

  • snappy

  • zlib

  • zstd

For more information on the compressors used with HEAVY.AI, see also:

HEAVY.AI does not compress the payload until it reaches a certain size. The default size limit is 512MB. You can change the size using the runtime flag compression-limit-bytes.

HEAVY.AI Distributed Cluster Example

This example uses four GPU-based machines, each with a combination of one or more CPUs and GPUs.

Hostname

IP

Role(s)

Node1

10.10.10.1

Leaf, Aggregator

Node2

10.10.10.2

Leaf, String Dictionary Server

Node3

10.10.10.3

Leaf

Node4

10.10.10.4

Leaf

Install HEAVY.AI server on each node. For larger deployments, you can have the install on a shared drive.

Set up the configuration file for the entire cluster. This file is the same for all nodes.

[
  {
    "host": "node1",
    "port": 16274,
    "role": "dbleaf"
  },
  {
    "host": "node2",
    "port": 16274,
    "role": "dbleaf"
  },
 {
    "host": "node3",
    "port": 16274,
    "role": "dbleaf"
  },
  {
    "host": "node4",
    "port": 16274,
    "role": "dbleaf"
  },

  {
    "host": "node2",
    "port": 6277,
    "role": "string"
  }
]

In the cluster.conf file, the location of each leaf node is identified as well as the location of the String Dictionary server.

Here, dbleaf is a leaf node, and string is the String Dictionary Server. The port each node is listening on is also identified. These ports must match the ports configured on the individual server.

Each leaf node requires a heavy.conf configuration file.

port = 16274
http-port = 16278
calcite-port = 16279
data = "<location>/heavyai-storage/nodeLocal/data"
read-only = false
string-servers = "<location>/heavyai-storage/cluster.conf"

The parameter string-servers identifies the file containing the cluster configuration, to tell the leaf node where the String Dictionary Server is.

The aggregator node requires a slightly different heavy.conf. The file is named heavy-agg.conf in this example.

heavy-agg.conf

port = 6274
http-port = 6278
calcite-port = 6279
data = "<location>/heavyai-storage/nodeLocalAggregator/data"
read-only = false
num-gpus = 1
cluster = "<location>/heavyai-storage/cluster.conf"

[web]
port = 6273
frontend = "<location>/prod/heavyai/frontend"

The parameter cluster tells the HeavyDB instance that it is an aggregator node, and where to find the rest of its cluster.

If your aggregator node is sharing a machine with a leaf node, there might be a conflict on the calcite-port. Consider changing the port number of the aggregator node to another that is not in use.

Implementing a HEAVY.AI Distributed Cluster

Contact HEAVY.AI support for assistance with HEAVY.AI Distributed Cluster implementation.

http://blosc.org/pages/synthetic-benchmarks/
https://quixdb.github.io/squash-benchmark/
https://lz4.github.io/lz4/