HeavyIQ LLM_TRANSFORM
The LLM_TRANSFORM operator allows you to leverage the power of the HeavyIQ Large Language Model directly from SQL to declaratively specify data transforms using natural language.
Note: The LLM_TRANSFORM operator requires HeavyIQ to be configured, otherwise the operator will error.
LLM_TRANSFORM can be highly useful for use cases such as the following:
Data cleanup: for example, standardizing messy date strings into a standardized format that can be casted to a HeavyDB DATE type
Classification: for example, classifying users from Slack logs as humans or bots, or credit card purchases as food, travel, entertainment, etc
Sentiment analysis: for example, determining whether the sentiment of tweets is postive, neutral, or negative
Entity extraction: return all persons/companies/emails etc from a text string
Limited fact lookup: LLM_TRANSFORM can be used to add factual information, for example to return the county seat of a county, or the headquarters city of a company. Note that care should be taken to check for hallucinations, particularly for more esoteric facts.
Syntax
LLM_TRANSFORM(
string_variable_or_expression,
'transform_instruction',
['output_choices_or_regex']
)
Parameters
string_variable_or_expression (mandatory): The input string variable or expression that the LLM will process. This can be any valid string data type or expression.
transform_instruction (mandatory): A string literal that specifies the transformation instruction for the LLM. This should be a clear and concise command or query directed to the LLM.
output_choices_or_regex (optional): A string literal that can either specify a set of pipe (|) delimited output choices or a regular expression (regex) bounded by forward slashes (/). This parameter constrains the possible outputs of the transformation. Note that for regexes,
^
and$
special characters to denote the start and end of the string are not supported.
Example with Output
heavysql> SELECT messy_date, TRY_CAST(LLM_TRANSFORM(messy_date, 'Extract the date of the event in YYYY-MM-DD format', '/\d{4}-\d{2}-\d{2}/') AS DATE) AS nice_date FROM date_strs;
messy_date|nice_date
November 3, 2008|2008-11-03
2008-11-04|2008-11-04
Nov 5, 2008|2008-11-05
06.11.2008|2008-11-06
7. November 2008|2008-11-07
Additional Example Usage
Basic usage without output constraints
SELECT
state_name,
LLM_TRANSFORM(
state_name,
'Return the capital of this US state'
) AS state_capital
FROM
usa_states;
Constrain output to a set of discrete choices (must be pipe-delimited)
SELECT
LLM_TRANSFORM(
user_name,
'Return whether this user is a human or a bot',
'human|bot'
) AS user_type,
COUNT(*) AS num_users
FROM
slack_logs
GROUP BY
user_type
ORDER BY
num_users DESC;
SELECT
DATE_TRUNC(DAY, tweet_created) AS "day",
LLM_TRANSFORM(
text_,
'Return the sentiment of this tweet as positive, neutral, or negative',
'positive|neutral|negative'
) AS sentiment,
COUNT(*) AS num_tweets
FROM
tweets_airlines
WHERE
text_ ILIKE '%delta%'
GROUP BY
"day",
sentiment
ORDER BY
"day",
sentiment;
Constrain output to a regex and cast to decimal (regex must be bounded by '/')
UPDATE
weather
SET
cleaned_temperature = TRY_CAST(
LLM_TRANSFORM(
temperature_str,
'Return a formatted decimal string with one decimal point for this temperature, i.e. 81.2',
'/-?\d+.\d/'
) AS DECIMAL(4, 1)
);
Run LLM_TRANSFORM twice to generate a longitude and latitude pair, each constrained by a regex.
SELECT
port_of_unlading,
TRY_CAST(
LLM_TRANSFORM(
port_of_unlading,
'Return the longitude of this port',
'/-?\d{1,3}\.\d{4}/'
) AS DOUBLE
) AS longitude,
TRY_CAST(
LLM_TRANSFORM(
port_of_unlading,
'Return the latitude of this port',
'/-?\d{1,2}\.\d{4}/'
) AS DOUBLE
) AS latitude,
COUNT(*) AS num_containers
FROM
bill_of_lading_dec_2014_apr_2018
GROUP BY
port_of_unlading
ORDER BY
num_containers DESC;
Restrictions and Performance Notes
By default,
LLM_TRANSFORM
is restricted to run on up to 1,000 unique string inputs per operator call. This limit can be changed at HeavyDB startup via the configuration flag--llm-transform-max-num-unique-value
. Note that if a table had a column containing US state names with only 50 unique values,LLM_TRANSFORM
will be allowed to run since 50 is lower than the default 1,000 unique string limit.The database will only run
LLM_TRANSFORM
once per unique input to allow for performant use of the operator on large tables with low cardinality columns.The database will try to push down any filters below an
LLM_TRANSFORM
call, so thatLLM_TRANFORM
needs to be called on the minimal set of inputs possible. Hence, for a table with a billion rows that has other filters (WHERE clauses) that filter the table to 20 rows,LLM_TRANSFORM
would only be called on the unique strings from those 20 rows. Hence filtering large tables can be a technique to allowLLM_TRANSFORM
to not hit the limit specified by--llm-transform-max-num-unique-value
(by default 1,000). Note that we do not yet push down filters ifLLM_TRANSFORM
itself is used as a filter.Currently embedded pipe characters (
|
) for output choice constraints, or leading or trailing forward slashes (/
) for output regex contraints are not allowed as they cannot be escaped, but this is planned to be addressed in a future release.
Last updated