HeavyIQ LLM_TRANSFORM
The LLM_TRANSFORM operator allows you to leverage the power of the HeavyIQ Large Language Model directly from SQL to declaratively specify data transforms using natural language.
Note: The LLM_TRANSFORM operator requires HeavyIQ to be configured, otherwise the operator will error.
LLM_TRANSFORM can be highly useful for use cases such as the following:
Data cleanup: for example, standardizing messy date strings into a standardized format that can be casted to a HeavyDB DATE type
Classification: for example, classifying users from Slack logs as humans or bots, or credit card purchases as food, travel, entertainment, etc
Sentiment analysis: for example, determining whether the sentiment of tweets is postive, neutral, or negative
Entity extraction: return all persons/companies/emails etc from a text string
Limited fact lookup: LLM_TRANSFORM can be used to add factual information, for example to return the county seat of a county, or the headquarters city of a company. Note that care should be taken to check for hallucinations, particularly for more esoteric facts.
Syntax
LLM_TRANSFORM(
    string_variable_or_expression, 
    'transform_instruction', 
    ['output_choices_or_regex']
)Parameters
string_variable_or_expression (mandatory): The input string variable or expression that the LLM will process. This can be any valid string data type or expression.
transform_instruction (mandatory): A string literal that specifies the transformation instruction for the LLM. This should be a clear and concise command or query directed to the LLM.
output_choices_or_regex (optional): A string literal that can either specify a set of pipe (|) delimited output choices or a regular expression (regex) bounded by forward slashes (/). This parameter constrains the possible outputs of the transformation. Note that for regexes,
^and$special characters to denote the start and end of the string are not supported.
Example with Output
heavysql> SELECT messy_date, TRY_CAST(LLM_TRANSFORM(messy_date, 'Extract the date of the event in YYYY-MM-DD format', '/\d{4}-\d{2}-\d{2}/') AS DATE) AS nice_date FROM date_strs;
messy_date|nice_date
November 3, 2008|2008-11-03
2008-11-04|2008-11-04
Nov 5, 2008|2008-11-05
06.11.2008|2008-11-06
7. November 2008|2008-11-07Additional Example Usage
Basic usage without output constraints
SELECT
  state_name,
  LLM_TRANSFORM(
    state_name,
    'Return the capital of this US state'
  ) AS state_capital
FROM
  usa_states;Constrain output to a set of discrete choices (must be pipe-delimited)
SELECT
  LLM_TRANSFORM(
    user_name,
    'Return whether this user is a human or a bot',
    'human|bot'
  ) AS user_type,
  COUNT(*) AS num_users
FROM
  slack_logs
GROUP BY
  user_type
ORDER BY
  num_users DESC;SELECT
  DATE_TRUNC(DAY, tweet_created) AS "day",
  LLM_TRANSFORM(
    text_,
    'Return the sentiment of this tweet as positive, neutral, or negative',
    'positive|neutral|negative'
  ) AS sentiment,
  COUNT(*) AS num_tweets
FROM
  tweets_airlines
WHERE
  text_ ILIKE '%delta%'
GROUP BY
  "day",
  sentiment
ORDER BY
  "day",
  sentiment;Constrain output to a regex and cast to decimal (regex must be bounded by '/')
UPDATE
  weather
SET
  cleaned_temperature = TRY_CAST(
    LLM_TRANSFORM(
      temperature_str,
      'Return a formatted decimal string with one decimal point for this temperature, i.e. 81.2',
      '/-?\d+.\d/'
    ) AS DECIMAL(4, 1)
  );Run LLM_TRANSFORM twice to generate a longitude and latitude pair, each constrained by a regex.
SELECT
  port_of_unlading,
  TRY_CAST(
    LLM_TRANSFORM(
      port_of_unlading,
      'Return the longitude of this port',
      '/-?\d{1,3}\.\d{4}/'
    ) AS DOUBLE
  ) AS longitude,
  TRY_CAST(
    LLM_TRANSFORM(
      port_of_unlading,
      'Return the latitude of this port',
      '/-?\d{1,2}\.\d{4}/'
    ) AS DOUBLE
  ) AS latitude,
  COUNT(*) AS num_containers
FROM
  bill_of_lading_dec_2014_apr_2018
GROUP BY
  port_of_unlading
ORDER BY
  num_containers DESC;Restrictions and Performance Notes
By default,
LLM_TRANSFORMis restricted to run on up to 1,000 unique string inputs per operator call. This limit can be changed at HeavyDB startup via the configuration flag--llm-transform-max-num-unique-value. Note that if a table had a column containing US state names with only 50 unique values,LLM_TRANSFORMwill be allowed to run since 50 is lower than the default 1,000 unique string limit.The database will only run
LLM_TRANSFORMonce per unique input to allow for performant use of the operator on large tables with low cardinality columns.The database will try to push down any filters below an
LLM_TRANSFORMcall, so thatLLM_TRANFORMneeds to be called on the minimal set of inputs possible. Hence, for a table with a billion rows that has other filters (WHERE clauses) that filter the table to 20 rows,LLM_TRANSFORMwould only be called on the unique strings from those 20 rows. Hence filtering large tables can be a technique to allowLLM_TRANSFORMto not hit the limit specified by--llm-transform-max-num-unique-value(by default 1,000). Note that we do not yet push down filters ifLLM_TRANSFORMitself is used as a filter.Currently embedded pipe characters (
|) for output choice constraints, or leading or trailing forward slashes (/) for output regex contraints are not allowed as they cannot be escaped, but this is planned to be addressed in a future release.
\
Last updated