HEAVY.AI Docs
v8.1.0
v8.1.0
  • Welcome to HEAVY.AI Documentation
  • Overview
    • Overview
    • Release Notes
  • Installation and Configuration
    • System Requirements
      • Hardware Reference
      • Software Requirements
      • Licensing
    • Installation
      • Free Version
      • Installing on Docker
        • HEAVY.AI Installation using Docker on Ubuntu
      • Installing on Ubuntu
        • HEAVY.AI Installation on Ubuntu
        • Install NVIDIA Drivers and Vulkan on Ubuntu
      • Installing on Rocky Linux / RHEL
        • HEAVY.AI Installation on RHEL
        • Install NVIDIA Drivers and Vulkan on Rocky Linux and RHEL
      • Getting Started on AWS
      • Getting Started on GCP
      • Getting Started on Azure
      • Getting Started on Kubernetes (BETA)
      • Upgrading
        • Upgrading HEAVY.AI
        • Upgrading from Omnisci to HEAVY.AI 6.0
        • CUDA Compatibility Drivers
      • Uninstalling
      • Ports
    • Services and Utilities
      • Using Services
      • Using Utilities
    • Executor Resource Manager
    • Configuration Parameters
      • Overview
      • Configuration Parameters for HeavyDB
      • Configuration Parameters for HEAVY.AI Web Server
      • Configuration Parameters for HeavyIQ
    • Security
      • Roles and Privileges
        • Column-Level Security
      • Connecting Using SAML
      • Implementing a Secure Binary Interface
      • Encrypted Credentials in Custom Applications
      • LDAP Integration
    • Distributed Configuration
  • Loading and Exporting Data
    • Supported Data Sources
      • Kafka
      • Using HeavyImmerse Data Manager
      • Importing Geospatial Data
    • Command Line
      • Loading Data with SQL
      • Exporting Data
  • SQL
    • Data Definition (DDL)
      • Datatypes
      • Users and Databases
      • Tables
      • System Tables
      • Views
      • Policies
      • Comment
    • Data Manipulation (DML)
      • SQL Capabilities
        • ALTER SESSION SET
        • ALTER SYSTEM CLEAR
        • DELETE
        • EXPLAIN
        • INSERT
        • KILL QUERY
        • LIKELY/UNLIKELY
        • SELECT
        • SHOW
        • UPDATE
        • Arrays
        • Logical Operators and Conditional and Subquery Expressions
        • Table Expression and Join Support
        • Type Casts
      • Geospatial Capabilities
        • Uber H3 Hexagonal Modeling
      • Functions and Operators
      • System Table Functions
        • generate_random_strings
        • generate_series
        • tf_compute_dwell_times
        • tf_feature_self_similarity
        • tf_feature_similarity
        • tf_geo_rasterize
        • tf_geo_rasterize_slope
        • tf_graph_shortest_path
        • tf_graph_shortest_paths_distances
        • tf_load_point_cloud
        • tf_mandelbrot*
        • tf_point_cloud_metadata
        • tf_raster_contour_lines; tf_raster_contour_polygons
        • tf_raster_graph_shortest_slope_weighted_path
        • tf_rf_prop_max_signal (Directional Antennas)
        • ts_rf_prop_max_signal (Isotropic Antennas)
        • tf_rf_prop
      • Window Functions
      • Reserved Words
      • SQL Extensions
      • HeavyIQ LLM_TRANSFORM
  • HeavyImmerse
    • Introduction to HeavyImmerse
    • Admin Portal
    • Control Panel
    • Working with Dashboards
      • Dashboard List
      • Creating a Dashboard
      • Configuring a Dashboard
      • Duplicating and Sharing Dashboards
    • Measures and Dimensions
    • Using Parameters
    • Using Filters
    • Using Cross-link
    • Chart Animation
    • Multilayer Charts
    • SQL Editor
    • Customization
    • Joins (Beta)
    • Chart Types
      • Overview
      • Bubble
      • Choropleth
      • Combo
      • Contour
      • Cross-Section
      • Gauge
      • Geo Heatmap
      • Heatmap
      • Linemap
      • Number
      • Pie
      • Pointmap
      • Scatter Plot
      • Skew-T
      • Table
      • Text Widget
      • Wind Barb
    • Deprecated Charts
      • Bar
      • Combo - Original
      • Histogram
      • Line
      • Stacked Bar
    • HeavyIQ SQL Notebook
  • HEAVYIQ Conversational Analytics
    • HeavyIQ Overview
      • HeavyIQ Guidance
  • HeavyRF
    • Introduction to HeavyRF
    • Getting Started
    • HeavyRF Table Functions
  • HeavyConnect
    • HeavyConnect Release Overview
    • Getting Started
    • Best Practices
    • Examples
    • Command Reference
    • Parquet Data Wrapper Reference
    • ODBC Data Wrapper Reference
    • Raster Data Wrapper Reference
  • HeavyML (BETA)
    • HeavyML Overview
    • Clustering Algorithms
    • Regression Algorithms
      • Linear Regression
      • Random Forest Regression
      • Decision Tree Regression
      • Gradient Boosting Tree Regression
    • Principal Components Analysis
  • Python / Data Science
    • Data Science Foundation
    • JupyterLab Installation and Configuration
    • Using HEAVY.AI with JupyterLab
    • Python User-Defined Functions (UDFs) with the Remote Backend Compiler (RBC)
      • Installation
      • Registering and Using a Function
      • User-Defined Table Functions
      • RBC UDF/UDTF Example Notebooks
      • General UDF/UDTF Tutorial Notebooks
      • RBC API Reference
    • Ibis
    • Interactive Data Exploration with Altair
    • Additional Examples
      • Forecasting with HEAVY.AI and Prophet
  • APIs and Interfaces
    • Overview
    • heavysql
    • Thrift
    • JDBC
    • ODBC
    • Vega
      • Vega Tutorials
        • Vega at a Glance
        • Getting Started with Vega
        • Getting More from Your Data
        • Creating More Advanced Charts
        • Using Polys Marks Type
        • Vega Accumulator
        • Using Transform Aggregation
        • Improving Rendering with SQL Extensions
      • Vega Reference Overview
        • data Property
        • projections Property
        • scales Property
        • marks Property
      • Migration
        • Migrating Vega Code to Dynamic Poly Rendering
      • Try Vega
    • RJDBC
    • SQuirreL SQL
    • heavyai-connector
  • Tutorials and Demos
    • Loading Data
    • Using Heavy Immerse
    • Hello World
    • Creating a Kafka Streaming Application
    • Getting Started with Open Source
    • Try Vega
  • Troubleshooting and Special Topics
    • FAQs
    • Troubleshooting
    • Vulkan Renderer
    • Optimizing
    • Known Issues and Limitations
    • Logs and Monitoring
    • Archived Release Notes
      • Release 6.x
      • Release 5.x
      • Release 4.x
      • Release 3.x
Powered by GitBook
On this page
Export as PDF
  1. SQL
  2. Data Manipulation (DML)

HeavyIQ LLM_TRANSFORM

The LLM_TRANSFORM operator allows you to leverage the power of the HeavyIQ Large Language Model directly from SQL to declaratively specify data transforms using natural language.

Note: The LLM_TRANSFORM operator requires HeavyIQ to be configured, otherwise the operator will error.

LLM_TRANSFORM can be highly useful for use cases such as the following:

  • Data cleanup: for example, standardizing messy date strings into a standardized format that can be casted to a HeavyDB DATE type

  • Classification: for example, classifying users from Slack logs as humans or bots, or credit card purchases as food, travel, entertainment, etc

  • Sentiment analysis: for example, determining whether the sentiment of tweets is postive, neutral, or negative

  • Entity extraction: return all persons/companies/emails etc from a text string

  • Limited fact lookup: LLM_TRANSFORM can be used to add factual information, for example to return the county seat of a county, or the headquarters city of a company. Note that care should be taken to check for hallucinations, particularly for more esoteric facts.

Syntax

LLM_TRANSFORM(
    string_variable_or_expression, 
    'transform_instruction', 
    ['output_choices_or_regex']
)

Parameters

  • string_variable_or_expression (mandatory): The input string variable or expression that the LLM will process. This can be any valid string data type or expression.

  • transform_instruction (mandatory): A string literal that specifies the transformation instruction for the LLM. This should be a clear and concise command or query directed to the LLM.

  • output_choices_or_regex (optional): A string literal that can either specify a set of pipe (|) delimited output choices or a regular expression (regex) bounded by forward slashes (/). This parameter constrains the possible outputs of the transformation. Note that for regexes, ^ and $ special characters to denote the start and end of the string are not supported.

Example with Output

heavysql> SELECT messy_date, TRY_CAST(LLM_TRANSFORM(messy_date, 'Extract the date of the event in YYYY-MM-DD format', '/\d{4}-\d{2}-\d{2}/') AS DATE) AS nice_date FROM date_strs;
messy_date|nice_date
November 3, 2008|2008-11-03
2008-11-04|2008-11-04
Nov 5, 2008|2008-11-05
06.11.2008|2008-11-06
7. November 2008|2008-11-07

Additional Example Usage

Basic usage without output constraints

SELECT
  state_name,
  LLM_TRANSFORM(
    state_name,
    'Return the capital of this US state'
  ) AS state_capital
FROM
  usa_states;

Constrain output to a set of discrete choices (must be pipe-delimited)

SELECT
  LLM_TRANSFORM(
    user_name,
    'Return whether this user is a human or a bot',
    'human|bot'
  ) AS user_type,
  COUNT(*) AS num_users
FROM
  slack_logs
GROUP BY
  user_type
ORDER BY
  num_users DESC;
SELECT
  DATE_TRUNC(DAY, tweet_created) AS "day",
  LLM_TRANSFORM(
    text_,
    'Return the sentiment of this tweet as positive, neutral, or negative',
    'positive|neutral|negative'
  ) AS sentiment,
  COUNT(*) AS num_tweets
FROM
  tweets_airlines
WHERE
  text_ ILIKE '%delta%'
GROUP BY
  "day",
  sentiment
ORDER BY
  "day",
  sentiment;

Constrain output to a regex and cast to decimal (regex must be bounded by '/')

UPDATE
  weather
SET
  cleaned_temperature = TRY_CAST(
    LLM_TRANSFORM(
      temperature_str,
      'Return a formatted decimal string with one decimal point for this temperature, i.e. 81.2',
      '/-?\d+.\d/'
    ) AS DECIMAL(4, 1)
  );

Run LLM_TRANSFORM twice to generate a longitude and latitude pair, each constrained by a regex.

SELECT
  port_of_unlading,
  TRY_CAST(
    LLM_TRANSFORM(
      port_of_unlading,
      'Return the longitude of this port',
      '/-?\d{1,3}\.\d{4}/'
    ) AS DOUBLE
  ) AS longitude,
  TRY_CAST(
    LLM_TRANSFORM(
      port_of_unlading,
      'Return the latitude of this port',
      '/-?\d{1,2}\.\d{4}/'
    ) AS DOUBLE
  ) AS latitude,
  COUNT(*) AS num_containers
FROM
  bill_of_lading_dec_2014_apr_2018
GROUP BY
  port_of_unlading
ORDER BY
  num_containers DESC;

Restrictions and Performance Notes

  • By default, LLM_TRANSFORM is restricted to run on up to 1,000 unique string inputs per operator call. This limit can be changed at HeavyDB startup via the configuration flag --llm-transform-max-num-unique-value. Note that if a table had a column containing US state names with only 50 unique values, LLM_TRANSFORM will be allowed to run since 50 is lower than the default 1,000 unique string limit.

  • The database will only run LLM_TRANSFORM once per unique input to allow for performant use of the operator on large tables with low cardinality columns.

  • The database will try to push down any filters below an LLM_TRANSFORM call, so that LLM_TRANFORM needs to be called on the minimal set of inputs possible. Hence, for a table with a billion rows that has other filters (WHERE clauses) that filter the table to 20 rows, LLM_TRANSFORM would only be called on the unique strings from those 20 rows. Hence filtering large tables can be a technique to allow LLM_TRANSFORM to not hit the limit specified by --llm-transform-max-num-unique-value(by default 1,000). Note that we do not yet push down filters if LLM_TRANSFORM itself is used as a filter.

  • Currently embedded pipe characters (|) for output choice constraints, or leading or trailing forward slashes (/) for output regex contraints are not allowed as they cannot be escaped, but this is planned to be addressed in a future release.

PreviousSQL ExtensionsNextIntroduction to HeavyImmerse

Last updated 9 months ago