HeavyIQ LLM_TRANSFORM
The LLM_TRANSFORM operator allows you to leverage the power of the HeavyIQ Large Language Model directly from SQL to declaratively specify data transforms using natural language.
Note: The LLM_TRANSFORM operator requires HeavyIQ to be configured, otherwise the operator will error.
LLM_TRANSFORM can be highly useful for use cases such as the following:
Data cleanup: for example, standardizing messy date strings into a standardized format that can be casted to a HeavyDB DATE type
Classification: for example, classifying users from Slack logs as humans or bots, or credit card purchases as food, travel, entertainment, etc
Sentiment analysis: for example, determining whether the sentiment of tweets is postive, neutral, or negative
Entity extraction: return all persons/companies/emails etc from a text string
Limited fact lookup: LLM_TRANSFORM can be used to add factual information, for example to return the county seat of a county, or the headquarters city of a company. Note that care should be taken to check for hallucinations, particularly for more esoteric facts.
Syntax
Parameters
string_variable_or_expression (mandatory): The input string variable or expression that the LLM will process. This can be any valid string data type or expression.
transform_instruction (mandatory): A string literal that specifies the transformation instruction for the LLM. This should be a clear and concise command or query directed to the LLM.
output_choices_or_regex (optional): A string literal that can either specify a set of pipe (|) delimited output choices or a regular expression (regex) bounded by forward slashes (/). This parameter constrains the possible outputs of the transformation. Note that for regexes,
^
and$
special characters to denote the start and end of the string are not supported.
Example with Output
Additional Example Usage
Basic usage without output constraints
Constrain output to a set of discrete choices (must be pipe-delimited)
Constrain output to a regex and cast to decimal (regex must be bounded by '/')
Run LLM_TRANSFORM twice to generate a longitude and latitude pair, each constrained by a regex.
Restrictions and Performance Notes
By default,
LLM_TRANSFORM
is restricted to run on up to 1,000 unique string inputs per operator call. This limit can be changed at HeavyDB startup via the configuration flag--llm-transform-max-num-unique-value
. Note that if a table had a column containing US state names with only 50 unique values,LLM_TRANSFORM
will be allowed to run since 50 is lower than the default 1,000 unique string limit.The database will only run
LLM_TRANSFORM
once per unique input to allow for performant use of the operator on large tables with low cardinality columns.The database will try to push down any filters below an
LLM_TRANSFORM
call, so thatLLM_TRANFORM
needs to be called on the minimal set of inputs possible. Hence, for a table with a billion rows that has other filters (WHERE clauses) that filter the table to 20 rows,LLM_TRANSFORM
would only be called on the unique strings from those 20 rows. Hence filtering large tables can be a technique to allowLLM_TRANSFORM
to not hit the limit specified by--llm-transform-max-num-unique-value
(by default 1,000). Note that we do not yet push down filters ifLLM_TRANSFORM
itself is used as a filter.Currently embedded pipe characters (
|
) for output choice constraints, or leading or trailing forward slashes (/
) for output regex contraints are not allowed as they cannot be escaped, but this is planned to be addressed in a future release.
Last updated