# HeavyIQ LLM\_TRANSFORM

**Note:** The LLM\_TRANSFORM operator requires HeavyIQ to be configured, otherwise the operator will error.

LLM\_TRANSFORM can be highly useful for use cases such as the following:

* Data cleanup: for example, standardizing messy date strings into a standardized format that can be casted to a HeavyDB DATE type
* Classification: for example, classifying users from Slack logs as humans or bots, or credit card purchases as food, travel, entertainment, etc
* Sentiment analysis: for example, determining whether the sentiment of tweets is postive, neutral, or negative
* Entity extraction: return all persons/companies/emails etc from a text string
* Limited fact lookup: LLM\_TRANSFORM can be used to add factual information, for example to return the county seat of a county, or the headquarters city of a company. Note that care should be taken to check for hallucinations, particularly for more esoteric facts.

**Syntax**

```sql
LLM_TRANSFORM(
    string_variable_or_expression, 
    'transform_instruction', 
    ['output_choices_or_regex']
)
```

**Parameters**

* **string\_variable\_or\_expression** (mandatory): The input string variable or expression that the LLM will process. This can be any valid string data type or expression.
* **transform\_instruction** (mandatory): A string literal that specifies the transformation instruction for the LLM. This should be a clear and concise command or query directed to the LLM.
* **output\_choices\_or\_regex** (optional): A string literal that can either specify a set of pipe (|) delimited output choices or a regular expression (regex) bounded by forward slashes (/). This parameter constrains the possible outputs of the transformation. Note that for regexes, `^` and `$` special characters to denote the start and end of the string are not supported.

**Example with Output**

{% code overflow="wrap" %}

```sql
heavysql> SELECT messy_date, TRY_CAST(LLM_TRANSFORM(messy_date, 'Extract the date of the event in YYYY-MM-DD format', '/\d{4}-\d{2}-\d{2}/') AS DATE) AS nice_date FROM date_strs;
messy_date|nice_date
November 3, 2008|2008-11-03
2008-11-04|2008-11-04
Nov 5, 2008|2008-11-05
06.11.2008|2008-11-06
7. November 2008|2008-11-07
```

{% endcode %}

**Additional Example Usage**

Basic usage without output constraints

```sql
SELECT
  state_name,
  LLM_TRANSFORM(
    state_name,
    'Return the capital of this US state'
  ) AS state_capital
FROM
  usa_states;
```

Constrain output to a set of discrete choices (must be pipe-delimited)

```sql
SELECT
  LLM_TRANSFORM(
    user_name,
    'Return whether this user is a human or a bot',
    'human|bot'
  ) AS user_type,
  COUNT(*) AS num_users
FROM
  slack_logs
GROUP BY
  user_type
ORDER BY
  num_users DESC;
```

```sql
SELECT
  DATE_TRUNC(DAY, tweet_created) AS "day",
  LLM_TRANSFORM(
    text_,
    'Return the sentiment of this tweet as positive, neutral, or negative',
    'positive|neutral|negative'
  ) AS sentiment,
  COUNT(*) AS num_tweets
FROM
  tweets_airlines
WHERE
  text_ ILIKE '%delta%'
GROUP BY
  "day",
  sentiment
ORDER BY
  "day",
  sentiment;
```

Constrain output to a regex and cast to decimal (regex must be bounded by '/')

```sql
UPDATE
  weather
SET
  cleaned_temperature = TRY_CAST(
    LLM_TRANSFORM(
      temperature_str,
      'Return a formatted decimal string with one decimal point for this temperature, i.e. 81.2',
      '/-?\d+.\d/'
    ) AS DECIMAL(4, 1)
  );
```

Run LLM\_TRANSFORM twice to generate a longitude and latitude pair, each constrained by a regex.

```sql
SELECT
  port_of_unlading,
  TRY_CAST(
    LLM_TRANSFORM(
      port_of_unlading,
      'Return the longitude of this port',
      '/-?\d{1,3}\.\d{4}/'
    ) AS DOUBLE
  ) AS longitude,
  TRY_CAST(
    LLM_TRANSFORM(
      port_of_unlading,
      'Return the latitude of this port',
      '/-?\d{1,2}\.\d{4}/'
    ) AS DOUBLE
  ) AS latitude,
  COUNT(*) AS num_containers
FROM
  bill_of_lading_dec_2014_apr_2018
GROUP BY
  port_of_unlading
ORDER BY
  num_containers DESC;
```

**Restrictions and Performance Notes**

* By default, `LLM_TRANSFORM` is restricted to run on up to 1,000 unique string inputs per operator call. This limit can be changed at HeavyDB startup via the configuration flag `--llm-transform-max-num-unique-value`. Note that if a table had a column containing US state names with only 50 unique values, `LLM_TRANSFORM` will be allowed to run since 50 is lower than the default 1,000 unique string limit.
* The database will only run `LLM_TRANSFORM` once per unique input to allow for performant use of the operator on large tables with low cardinality columns.
* The database will try to push down any filters below an `LLM_TRANSFORM` call, so that `LLM_TRANFORM` needs to be called on the minimal set of inputs possible. Hence, for a table with a billion rows that has other filters (WHERE clauses) that filter the table to 20 rows, `LLM_TRANSFORM` would only be called on the unique strings from those 20 rows. Hence filtering large tables can be a technique to allow `LLM_TRANSFORM` to not hit the limit specified by `--llm-transform-max-num-unique-value`(by default 1,000). Note that we do not yet push down filters if `LLM_TRANSFORM` itself is used as a filter.
* Currently embedded pipe characters (`|`) for output choice constraints, or leading or trailing forward slashes (`/`) for output regex contraints are not allowed as they cannot be escaped, but this is planned to be addressed in a future release.

\\


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.heavy.ai/sql/data-manipulation-dml/heavyiq-llm_transform.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
