Optimizing
Variable
Default
Description
CUDA_CACHE_DISABLE
0
Setting to 1 disables caching (no binary code is added to or retrieved from the cache).
CUDA_CACHE_MAXSIZE
256 MiB
Specifies the size of the compute cache in bytes; the maximum size is 4 GiB. Binary code that exceeds the cache size is not cached. Older binary code is removed to make room for newer code as needed.
CUDA_CACHE_PATH
MacOS: $HOME/Library/Application Support/NVIDIA/ComputeCache
Linux: ~/.nv/ComputeCach
Specifies the directory location of computer cache files. When running OmniSci on a Docker instance, OmniSci recommends that you set this environment variable to $OMNISCI_STORAGE/NVIDIA/ComputeCache
.
CUDA_FORCE_PTX_JIT
0
Setting to 1 forces the device driver to ignore any binary code embedded in an application and to just-in-time compile embedded PTX code instead. If a kernel does not have embedded PTX code, it will fail to load. You can use this environment variable to confirm that an application binary contains PTX code and that just-in-time compilation works as expected to guarantee forward compatibility with future architectures.
Hardware
Even though HEAVY.AI is an “in-memory” database, when the database first starts up, it reads data from disk. A large database can take a long time to read from a hard disk. Import and execution performance rely on disks with high-performance characteristics to match the general nature of the database.
As a starting point, HEAVY.AI recommends fast SSD drives on a good hardware controller in RAID 10 configuration. If you use a virtual machine such as Amazon Web Services, HEAVY.AI recommends you use Provisioned IOPS SSD disks in RAID configuration for storage.
Do not run unnecessary daemons. Ideally, only HEAVY.AI services run on your HEAVY.AI server.
For a production server, set the performance setting to performance instead of power saving. The performance setting is typically controlled by the system BIOS and prevents throttling back of the CPU. You must also change the settings in the Linux power governor setup.
A large amount of swap activity on the machine can indicate a memory shortage. Compare the amount of data the database is attempting to process in memory to the amount of memory available.
Because some work is always done on CPUs, speed is important. HEAVY.AI recommends you use systems that balance a high core count with high CPU speed.
Use the
nvidia-smi -pm
andnvidia-smi -ac
commands to maximize GPU clock speeds:
Enable the NVIDIA persistence daemon
nvidia-persistenced
. See Persistence Daemon Usage.
Database Design
Review a representative sample of the data from which your table is to be created. This helps you determine the datatypes best suited to your columns. Where possible, place data into columns with the smallest representation that can fit the cardinality involved.
Look for these areas of potential optimization:
Can you apply fixed encoding to TIMESTAMP fields?
Can you apply fixed sizes to FIXED ENCODING DICT fields?
What kind of INTEGER is appropriate for the values involved?
Is DOUBLE required, or is FLOAT enough to store expected values?
Is ENCODING NONE set for high-cardinality TEXT fields?
Can the data be converted from its current form to a more denormalized form?
Using the smallest possible encoding increases the speed of all aspects of HEAVY.AI, from initial load to query execution.
Loading Data
Loading large flat files of 100 MB or larger is the most efficient way to import data to HEAVY.AI.
Consider increasing the block sizes of
StreamInserter
orSQLImporter
to reduce the overhead of records loaded or streamed.If you use a particular column on a regular basis to restrict queries to a table, load the table sorted on the data in that column. For example, if most queries have a DATE dimension, then load data in date order for best performance.
When using a large-cardinality column frequently for GROUP BY or as a JOIN column, you can improve performance by creating the table with a sort column; for example:
Then, when ingesting using COPY FROM, increase the BUFFER_SIZE parameter, up to 128 MB instead of the default 8 MB, to provide a larger window for sorting the data.
Preloading Data
On startup, you can load data from a standard list of queries. You can customize the queries for each analyst and load into memory the data they commonly use first.
Creating the Query List
Create a file with a code block for each user. The keyword USER must be uppercase. Provide the user name and database name, followed by a series of SQL query statements enclosed in curly braces.
Add code blocks for all users who benefit from pre-loaded data.
Ideally, you should use a curated sampling of common queries extracted from log files.
To load a column into GPU memory, use an aggregate function on a specific column. For example, you can use AVG(
columnName
)
for numerical columns, and COUNT(
columnName
)
for non-nullable text columns.
Loading Data Using the Query List
Follow these steps to load the default queries on startup:
Create your query list file and store it on your HEAVY.AI server.
If necessary, stop your HEAVY.AI server.
Restart your server with the following option:
start_heavydb --db-query-list
On startup, your queries are loaded automatically, speeding up defined query results.
You must have ACCESS and SELECT permissions to preload data.
Using Joins
HeavyDB supports relational joins across all scalar column types and with any Boolean binary operator. If you use joins in HEAVY.AI, consider the following to optimize query performance:
Equijoin queries (joins between two columns with an = predicate) are typically executed using an accelerated hash join framework, and therefore perform better than range joins. For example:
performs better than:
For an equijoin query, HeavyDB runs more efficiently if the range of the join key space is small. For example, say you are joining a column defined as BIGINT that contains approximately 1 million reference numbers ranging from 1000000000 to 9000000000. In this instance, HEAVY.AI interprets the range to be 8 billion potential values -- every potential value between 1 billion and 9 billion.
In this case, you want to store these keys as TEXT ENCODING DICT instead. This maps each reference number to a unique identity, and HEAVY.AI recognizes that only 1 million values exist, thereby optimizing the join.
If join performance is slowing your Immerse dashboard, consider using
CREATE TABLE AS SELECT
to materialize a join expression as a new table. You can use the new table in your Immerse charts.In distributed mode, either sharding of both tables or replication of the inner table is required to execute joins.
Sharding can significantly improve performance of join operations in multi-GPU single-node setups if the join is executed on the shard key.
Parallel GPUs
Parsing, optimization, and parts of rendering can overlap between queries, but most of the execution occurs single file. In general, you get the most throughput on the GPU by letting a query have all the resources. Contention is not a concern for buffer or cache memory. If queries are done very quickly, you get low latency, even with many simultaneous queries.
For simple queries on relatively small datasets, consider executing queries on subsets of GPUs. Different GPU groups can execute at the same time. This configuration benefits from parallelizing “fixed overheads” on each query between HEAVY.AI servers on the same node.
You can implement this behavior by running multiple HEAVY.AI servers on the same node and mapping each to different sets of GPUs with the --start-gpu
and --num-gpus
flags (see Configuration file).
CUDA JIT Cache
When your device driver compiles PTX code for an application, it automatically caches a copy of the generated binary code to avoid repeating the compilation in later invocations of the application. The cache — referred to as the compute cache — is automatically invalidated when you upgrade your device driver so that applications can benefit from improvements in the just-in-time compiler built into the device driver.
You can use environment variables to control just-in-time compilation.
Variable
Default
Description
CUDA_CACHE_DISABLE
0
Setting to 1 disables caching (no binary code is added to or retrieved from the cache).
CUDA_CACHE_MAXSIZE
256 MiB
Specifies the size of the compute cache in bytes; the maximum size is 4 GiB. Binary code that exceeds the cache size is not cached. Older binary code is removed to make room for newer code as needed.
CUDA_CACHE_PATH
MacOS: $HOME/Library/Application Support/NVIDIA/ComputeCache
Linux: ~/.nv/ComputeCach
Specifies the directory location of computer cache files. When running HEAVY.AI on a Docker instance, HEAVY.AI recommends that you set this environment variable to $HEAVYAI_STORAGE/NVIDIA/ComputeCache
.
CUDA_FORCE_PTX_JIT
0
Setting to 1 forces the device driver to ignore any binary code embedded in an application and to just-in-time compile embedded PTX code instead. If a kernel does not have embedded PTX code, it will fail to load. You can use this environment variable to confirm that an application binary contains PTX code and that just-in-time compilation works as expected to guarantee forward compatibility with future architectures.