Optimizing

Variable

Default

Description

CUDA_CACHE_DISABLE

0

Setting to 1 disables caching (no binary code is added to or retrieved from the cache).

CUDA_CACHE_MAXSIZE

256 MiB

Specifies the size of the compute cache in bytes; the maximum size is 4 GiB. Binary code that exceeds the cache size is not cached. Older binary code is removed to make room for newer code as needed.

CUDA_CACHE_PATH

MacOS: $HOME/Library/Application Support/NVIDIA/ComputeCache Linux: ~/.nv/ComputeCach

Specifies the directory location of computer cache files. When running OmniSci on a Docker instance, OmniSci recommends that you set this environment variable to $OMNISCI_STORAGE/NVIDIA/ComputeCache.

CUDA_FORCE_PTX_JIT

0

Setting to 1 forces the device driver to ignore any binary code embedded in an application and to just-in-time compile embedded PTX code instead. If a kernel does not have embedded PTX code, it will fail to load. You can use this environment variable to confirm that an application binary contains PTX code and that just-in-time compilation works as expected to guarantee forward compatibility with future architectures.

Hardware

  • Even though HEAVY.AI is an “in-memory” database, when the database first starts up, it reads data from disk. A large database can take a long time to read from a hard disk. Import and execution performance rely on disks with high-performance characteristics to match the general nature of the database.

    As a starting point, HEAVY.AI recommends fast SSD drives on a good hardware controller in RAID 10 configuration. If you use a virtual machine such as Amazon Web Services, HEAVY.AI recommends you use Provisioned IOPS SSD disks in RAID configuration for storage.

  • Do not run unnecessary daemons. Ideally, only HEAVY.AI services run on your HEAVY.AI server.

  • For a production server, set the performance setting to performance instead of power saving. The performance setting is typically controlled by the system BIOS and prevents throttling back of the CPU. You must also change the settings in the Linux power governor setup.

  • A large amount of swap activity on the machine can indicate a memory shortage. Compare the amount of data the database is attempting to process in memory to the amount of memory available.

  • Because some work is always done on CPUs, speed is important. HEAVY.AI recommends you use systems that balance a high core count with high CPU speed.

  • Use the nvidia-smi -pm and nvidia-smi -ac commands to maximize GPU clock speeds:

sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 3004,875

Database Design

Review a representative sample of the data from which your table is to be created. This helps you determine the datatypes best suited to your columns. Where possible, place data into columns with the smallest representation that can fit the cardinality involved.

Look for these areas of potential optimization:

  • Can you apply fixed encoding to TIMESTAMP fields?

  • Can you apply fixed sizes to FIXED ENCODING DICT fields?

  • What kind of INTEGER is appropriate for the values involved?

  • Is DOUBLE required, or is FLOAT enough to store expected values?

  • Is ENCODING NONE set for high-cardinality TEXT fields?

  • Can the data be converted from its current form to a more denormalized form?

Using the smallest possible encoding increases the speed of all aspects of HEAVY.AI, from initial load to query execution.

Loading Data

  • Loading large flat files of 100 MB or larger is the most efficient way to import data to HEAVY.AI.

  • Consider increasing the block sizes of StreamInserter or SQLImporter to reduce the overhead of records loaded or streamed.

  • If you use a particular column on a regular basis to restrict queries to a table, load the table sorted on the data in that column. For example, if most queries have a DATE dimension, then load data in date order for best performance.

  • When using a large-cardinality column frequently for GROUP BY or as a JOIN column, you can improve performance by creating the table with a sort column; for example:

    CREATE TABLE x...WITH(sort_column = y)

    Then, when ingesting using COPY FROM, increase the BUFFER_SIZE parameter, up to 128 MB instead of the default 8 MB, to provide a larger window for sorting the data.

Preloading Data

On startup, you can load data from a standard list of queries. You can customize the queries for each analyst and load into memory the data they commonly use first.

Creating the Query List

Create a file with a code block for each user. The keyword USER must be uppercase. Provide the user name and database name, followed by a series of SQL query statements enclosed in curly braces.

USER user_name db_name {
query_a1;
query_a2;
...
query_aN;
}

Add code blocks for all users who benefit from pre-loaded data.

...
USER user_name_2 db_name_2 {
query_b1;
query_b2;
...
query_bN;
}

Ideally, you should use a curated sampling of common queries extracted from log files.

To load a column into GPU memory, use an aggregate function on a specific column. For example, you can use AVG(columnName) for numerical columns, and COUNT(columnName) for non-nullable text columns.

Loading Data Using the Query List

Follow these steps to load the default queries on startup:

  1. Create your query list file and store it on your HEAVY.AI server.

  2. If necessary, stop your HEAVY.AI server.

  3. Restart your server with the following option:

    start_heavydb --db-query-list

On startup, your queries are loaded automatically, speeding up defined query results.

You must have ACCESS and SELECT permissions to preload data.

Using Joins

HeavyDB supports relational joins across all scalar column types and with any Boolean binary operator. If you use joins in HEAVY.AI, consider the following to optimize query performance:

  • Equijoin queries (joins between two columns with an = predicate) are typically executed using an accelerated hash join framework, and therefore perform better than range joins. For example:

    JOIN table_2 ON table_1.column_1 = table_2.column_b

    performs better than:

    JOIN table_2 ON table_1.column_1 > table_2.column_b AND table1_column_1 < table_2.column_b + 1000
  • For an equijoin query, HeavyDB runs more efficiently if the range of the join key space is small. For example, say you are joining a column defined as BIGINT that contains approximately 1 million reference numbers ranging from 1000000000 to 9000000000. In this instance, HEAVY.AI interprets the range to be 8 billion potential values -- every potential value between 1 billion and 9 billion.

    In this case, you want to store these keys as TEXT ENCODING DICT instead. This maps each reference number to a unique identity, and HEAVY.AI recognizes that only 1 million values exist, thereby optimizing the join.

  • If join performance is slowing your Immerse dashboard, consider using CREATE TABLE AS SELECT to materialize a join expression as a new table. You can use the new table in your Immerse charts.

  • In distributed mode, either sharding of both tables or replication of the inner table is required to execute joins.

  • Sharding can significantly improve performance of join operations in multi-GPU single-node setups if the join is executed on the shard key.

Parallel GPUs

Parsing, optimization, and parts of rendering can overlap between queries, but most of the execution occurs single file. In general, you get the most throughput on the GPU by letting a query have all the resources. Contention is not a concern for buffer or cache memory. If queries are done very quickly, you get low latency, even with many simultaneous queries.

For simple queries on relatively small datasets, consider executing queries on subsets of GPUs. Different GPU groups can execute at the same time. This configuration benefits from parallelizing “fixed overheads” on each query between HEAVY.AI servers on the same node.

You can implement this behavior by running multiple HEAVY.AI servers on the same node and mapping each to different sets of GPUs with the --start-gpu and --num-gpus flags (see Configuration file).

CUDA JIT Cache

When your device driver compiles PTX code for an application, it automatically caches a copy of the generated binary code to avoid repeating the compilation in later invocations of the application. The cache — referred to as the compute cache — is automatically invalidated when you upgrade your device driver so that applications can benefit from improvements in the just-in-time compiler built into the device driver.

You can use environment variables to control just-in-time compilation.

Variable

Default

Description

CUDA_CACHE_DISABLE

0

Setting to 1 disables caching (no binary code is added to or retrieved from the cache).

CUDA_CACHE_MAXSIZE

256 MiB

Specifies the size of the compute cache in bytes; the maximum size is 4 GiB. Binary code that exceeds the cache size is not cached. Older binary code is removed to make room for newer code as needed.

CUDA_CACHE_PATH

MacOS: $HOME/Library/Application Support/NVIDIA/ComputeCache Linux: ~/.nv/ComputeCach

Specifies the directory location of computer cache files. When running HEAVY.AI on a Docker instance, HEAVY.AI recommends that you set this environment variable to $HEAVYAI_STORAGE/NVIDIA/ComputeCache.

CUDA_FORCE_PTX_JIT

0

Setting to 1 forces the device driver to ignore any binary code embedded in an application and to just-in-time compile embedded PTX code instead. If a kernel does not have embedded PTX code, it will fail to load. You can use this environment variable to confirm that an application binary contains PTX code and that just-in-time compilation works as expected to guarantee forward compatibility with future architectures.

Last updated