1 of 1

HeavyIQ Model Overview (HeavyLM)

Intro

HeavyLM is the custom-trained Large Language Model (LLM) that is at the heart of the HeavyIQ Conversational Analytics module.

Below is a brief overview of the model’s capabilities as well as the training process we use to create it. You will also find a guide to deploying the model at the end of this document.

Model Overview

HeavyLM leverages Meta’s Llama 3.1 70B as the base model, applying over 70,000 training examples to make the model highly proficient in text-to-SQL as well as general analytics tasks such that it can be called for ETL-like workloads from the HeavyDB database. The training pipeline is such that the base model can be changed as needed, for example to move to more advanced models as they are released.

Base Model

As mentioned above, currently HeavyIQ is built on top of the Llama 3.1 70B open weights foundation model from Meta, a state-of-the-art open weights foundation model trained on over 16.4 trillion tokens, including extensive coding and multilingual data. Compared to its predecessor Llama 3, the context length (the number of tokens the model can fit inside its attention window and hence in a single prompt) was extended from 8,000 to 128,000 tokens, allowing significantly longer prompts and dialogues with the model than previously.

Fine-tuning

We conduct supervised fine tuning (SFT) of the base foundation model using over 70,000 fine tuning pairs to help instill in the model proficiency for data and analytics tasks, focused around but not limited to text-to-SQL.

The core text-to-SQL examples consist of both open training datasets like Spider and Bird, as well as many thousands of custom examples, many focused on core spatiotemporal workflows like geo-enrichment via geo-joins. Whether from open public or privately created datasets, we modify all SQL examples to confirm to HeavyDB specific SQL syntax as well, following best practices to maximize performance. All training examples are generated with live metadata from a HeavyDB database instance, with the format matching what will be seen at inference to maximize accuracy.

In addition to the core text-to-SQL training data, we train with significant auxiliary training examples to support the text-to-SQL use case. This includes training data for table selection given a prompt (“text-to-tables”), SQL error correction, using actual syntax errors generated by the database to teach the model how to fix improper SQL queries, and SQL accuracy judging, which can be used to improve inference time accuracy.

Finally, a significant number of training examples are given of a selection of tasks useful to call from HeavyDB SQL directly via the LLM_TRANSFORM operator. These tasks include classification (i.e. whether a username is likely a human or a bot), date string correction, sentiment analysis, named entity extraction, and generic fact recall (i.e. “What is the Capital of a given state?”). By training on these use cases, HeavyIQ becomes highly useful for fast in-database ETL (“Extract-Transform-Load”) workloads, where data can be enriched or corrected using the power of the LLM.

The training pipeline can either use full fine-tuning, or LoRA (Low Rank Adaptation), depending on compute and memory availability. It also optionally leverages DPO (Direct Preference Optimization), form of reinforcement learning that can deliver small additional accuracy gains.

Quantization

We provide the HeavyLM model in two sizes/compression levels, a version with the full 16-bit weights that requires at least 192GB of GPU VRAM to run, or a quantized 4-bit version of the same model that needs only 48GB. The smaller model requires much less hardware to run, and generally is faster in inference, with only small losses of accuracy (described below in the Benchmarks section). Hence we generally recommend our customers run the much smaller 4-bit version of the model, unless the small gains in accuracy from the full-sized model are seen as worth the tradeoff in terms of extra hardware required.

Benchmarks

On the standard Spider evaluation dataset, which the model is not trained on, the full 16-bit weight version of the model achieves a near state-of-the art 90.75% accuracy rating, while the 4-bit version of the model yields 88.95% accuracy. It should be noted that on hard problems, a larger difference may be noticed than this small accuracy difference suggests, due to the Spider benchmark consisting of a large number of easy or medium difficulty questions (with some questions that are ambiguous or otherwise difficult for any LLM or even human to ace).

Deployment

We recommend deploying HeavyLM using vLLM, a well-known and widely used open source inference engine that is highly optimized for high performance serving of large language models on Nvidia GPU hardware.

When calculating the amount of GPU memory (VRAM) necessary to deploy a model, there are a handful of key factors that come into play.

The number of parameters in the model. Here, all HeavyLM models use the Llama 3.1 70B model as a backbone, so the number of parameters for our calculations is 70 billion.
The quantization: the native un-compressed/un-quantized version of the model uses full 16-bit (2 bytes) precision per weight, which means that the full model requires 70 billion X 2 bytes = 140GB to just store the weights. On the other hand, the quantized (compressed) version of the model uses 4-bit (½ byte) weights, meaning that the compressed model requires only 70 billion X 0.5 bytes = 35GB to store the weights.
Context length: The context length defines the total number of input and output tokens the model can pay attention to (process) at once, basically equivalent to its window of memory. For use cases involving many tables with lots of columns, supporting longer context lengths becomes important. While Llama 3.1 natively supports a full 128K context length, we do not recommend running such long context lengths in production to the high amount of memory that would be required. Generally speaking, we generally recommend a context length of 16,384 tokens, although for many use cases 8,192 tokens will suffice. Note that the longer the context length, the more GPU memory (VRAM) the model will require to run.

The final thing to note is that vLLM requires that the model be either sharded such that there are an equal number of layers on each GPU. So this means that across 1,2,4 or 8 GPUs (i.e. 3,5,6, or 7 GPUs) will not work

Generally speaking, the quantized 4-bit model will fit on a single 48GB GPU with context length of 16,384 tokens (more memory available will allow for longer context lengths), while 192GB of VRAM is recommended for deployment of the un-quantized 16-bit model.

Docker Compose Deployment

This section explains the multi-service Docker Compose configuration that deploys several AI model services for text-to-sql inference, text embedding, and audio transcription for the HeavyIQ Conversational Analytics module. To serve the main HeavyLM model with vLLM, a Hugging Face token with granted read access to HeavyAI’s gated models is required. HeavyAI will approve user access on a case by case basis.d

Example Deployment with Docker Compose:

services:
  vllm:
    container_name: vllm
    image: vllm/vllm-openai:v0.6.2
      # image: vllm/vllm-openai:latest
    command:
      - --model=heavyai/heavyiq-llama-3-1-70b-combo-v61-5-judge-3584-tokens-q4a16
      - --trust-remote-code
      - --enable-prefix-caching
      - --dtype=bfloat16
      - --port=5000
      - --gpu-memory-utilization=0.87
      - --max-model-len=32768
      - --tensor-parallel-size=2 #can be used for parallel usage across GPU's
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0','1']
              capabilities: [gpu]
    runtime: nvidia
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=YOUR_HUGGINGFACE_TOKEN_HERE
      - VLLM_ALLOW_DEPRECATED_BEAM_SEARCH=1
    ports:
      - "5000:5000"
    ipc: host


  embeddings:
    container_name: embeddings
    image: ghcr.io/huggingface/text-embeddings-inference:1.3.0
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['1']
              capabilities: [gpu]
    ports:
      - "5001:80"
    volumes:
      - ./embeddings_data:/data
    command: [
      "--model-id", "Alibaba-NLP/gte-large-en-v1.5",
      "--revision", "main"
    ]


  vllm_embeddings:
    container_name: vllm_embeddings
    image: vllm/vllm-cpu-env
    command:
      - --model=intfloat/e5-mistral-7b-instruct
      - --port=8001
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=YOUR_HUGGINGFACE_TOKEN_HERE
    ports:
      - "5003:8001"
    ipc: host

  whisper:
    container_name: whisper
    image: whisper_server
    ports:
      - "5002:8080"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['1']
              capabilities: [gpu]
    runtime: nvidia

*There are many alternative embedding models available on the Hugging Face Hub besides the Alibaba model. Other embedding models may be used, as long as the same model is used both for embedding guidance snippets and for retrieval of those snippets.