Window Functions
Last updated
Last updated
Window functions allow you to work with a subset of rows related to the currently selected row. For a given dimension, you can find the most associated dimension by some other measure (for example, number of records or sum of revenue).
Window functions must always contain an OVER clause. The OVER clause splits up the rows of the query for processing by the window function.
The PARTITION BY list divides the rows into groups that share the same values of the PARTITION BY expression(s). For each row, the window function is computed using all rows in the same partition as the current row.
Rows that have the same value in the ORDER BY clause are considered peers. The ranking functions give the same answer for any two peer rows.
HeavyDB supports the aggregate functions AVG
, MIN
, MAX
, SUM
, and COUNT
in window functions.
Updates on window functions are supported, assuming the target table is single-fragment. Updates on multi-fragment target tables are not currently supported.
This query shows the top airline carrier for each state, based on the number of departures.
A window function can include a frame clause that specifies a set of neighboring rows of the current row belonging to the same partition. This allows us to compute a window aggregate function over the window frame, instead of computing it against the entire partition. Note that a window frame for the current row is computed based on either 1) the number of rows before or after the current row (called rows mode) or 2) the specified ordering column value in the frame clause (called range mode).
For example:
From the starting row of the partition to the current row: Using the sum
aggregate function, you can compute the running sum of the partition.
You can construct a frame based on the position of the rows (called rows mode): For example, a row before 3 rows and after 2 rows:
You can compute the aggregate function of the frame having up to six rows (including the current row).
You can organize a frame based on the value of the ordering column (called range mode): Assuming C as the current ordering column value, we can compute aggregate value of the window frame which contains rows having ordering column values between (C - 3) and (C + 2).
Window functions that ignore the frame are evaluated on the entire partition.
Note that we can define the window frame clause using rows mode with an ordering column.
You can use the following aggregate functions with the window frame clause.
<frame_mode>
| <frame_bound>
<frame_mode>
can be one of the following:
rows
range
1 | 2 | 3 | 4 | 5.5 | 7.5 | 8 | 9 | 10 → value of a each tuple’s order by expression.
When the current row has a value 5.5:
ROWS BETWEEN 3 PRECEDING and 3 FOLLOWING : 3 rows before and 3 rows after → {2, 3, 4, 5.5, 7.5, 8, 9 }
RANGE BETWEEN 3 PRECEDING and 3 FOLLOWING: 5.5 - 3 <= x <= 5.5 + 3 → { 3, 4, 5.5, 8 }
<frame_bound
>:
frame_start or
frame_between: between frame_start and frame_end
frame_start and frame_end can be one of the following:
UNBOUNDED PRECEDING: The start row of the partition that the current row belongs to.
UNBOUNDED FOLLOWING: The end row of the partition that the current row belongs to.
CURRENT ROW
For rows mode: the current row.
For range mode: the peers of the current row. A peer is a row having the same value as the ordering column expression of the current row. Note that all null values are peers of each other.
expr PRECEDING
For rows mode: expr row before the current row.
For range mode: rows with the current row’s ordering expression value minus expr.
For DATE, TIME, and TIMESTAMP: Use the INTERVAL keyword with a specific time unit, depending on a data type:
TIMESTAMP type: NANOSECOND, MICROSECOND, MILLISECOND, SECOND, MINUTE, HOUR, DAY, MONTH, and YEAR
TIME type: SECOND, MINUTE, and HOUR
DATE type: DAY, MONTH, and YEAR
For example:
RANGE BETWEEN INTERVAL 1 DAY PRECEDING and INTERVAL 3 DAY FOLLOWING
Currently, only literal expressions as expr such as 1 PRECEDING and 100 PRECEDING are supported.
expr FOLLOWING
For rows mode: expr row after the current row.
For range mode: rows with the current row’s ordering expression value plus expr.
For DATE, TIME, and TIMESTAMP: Use the INTERVAL keyword with a specific time unit, depending on a data type:
TIMESTAMP type: NANOSECOND, MICROSECOND, MILLISECOND, SECOND, MINUTE, HOUR, DAY, MONTH, and YEAR
TIME type: SECOND, MINUTE, and HOUR
DATE type: DAY, MONTH, and YEAR
For example:
RANGE BETWEEN INTERVAL 1 DAY PRECEDING and INTERVAL 3 DAY FOLLOWING
Currently, only support literal expression as expr such as 1 FOLLOWING and 100 FOLLOWING are supported.
UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING have the same meaning in both rows and range mode.
When the query has no window frame bound, the window aggregate function is computed differently depending on the existence of the ORDER BY clause:
Has ORDER BY clause: The window function is computed with the default frame bound, which is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
No Order BY clause: The window function is computed over the entire partition.
You can refer to the same window clause in multiple window aggregate functions by defining it with a unique name in the query definition.
For example, you can define the named window clauses W1 and W2 as follows:
Named window function clause w1
refers to a window function clause without a window frame clause, and w2
refers to a named window frame clause.
To use window framing, you may need an ORDER BY clause in the window definition. Depending on the framing mode used, the constraint varies:
Row mode: no restriction of the existence of the ordering column. It also can include multiple ordering columns.
Range mode: only a single ordering column is required (not multi-column ordering).
Currently, all window functions including aggregation over window frame are computed via CPU-mode.
For window frame bound expressions, only non-negative integer literals are supported.
GROUPING mode and EXCLUDING are not currently supported.
Category | Supported Functions |
---|---|
Function
Description
BACKWARD_FILL(value)
Replace the null value by using the nearest non-null value of the value column, using backward search.
For example, for column x,
with the current row r
at the index K
having a NULL value, and assuming column x
has N
rows (where K < N
):
BACKWARD_FILL(x)
searches for the non-NULL value by searching rows with the index starting from K+1
to N
. The NULL value is replaced with the first non-NULL value found.
At least one ordering column must be defined in the window clause.
NULLS FIRST ordering of the input value is added automatically for any user-defined ordering of the input value. For example:
BACKWARD_FILL(x) OVER (PARTITION BY c ORDER BY x)
- No ordering is added; ordering already exists on x
.
BACKWARD_FILL(x) OVER (PARTITION BY c ORDER BY o)
- Ordering is added internally for a consistent query result.
CONDITIONAL_CHANGE_EVENT(expr)
For each partition, a zero-initialized counter is incremented every time the result of expr changes as the expression is evaluated over the partition. Requires an ORDER BY clause for the window.
COUNT_IF(condition_expr)
Aggregate function that can be used as a window function for both a nonframed window partition and a window frame. Returns the number of rows satisfying the given condition_expr
, which must evaluate to a Boolean value (TRUE/FALSE) like x IS NULL
or x > 1
.
CUME_DIST()
Cumulative distribution value of the current row: (number of rows preceding or peers of the current row)/(total rows). Window framing is ignored.
DENSE_RANK()
Rank of the current row without gaps. This function counts peer groups. Window framing is ignored.
FIRST_VALUE(value)
Returns the value from the first row of the window frame (the rows from the start of the partition to the last peer of the current row).
FORWARD_FILL(value)
Replace the null value by using the nearest non-null value of the value column, using forward search.
For example, for column x,
with the current row r
at the index K
having a NULL value, and assuming column x
has N
rows (where K < N
):
FORWARD_FILL(x)
searches for the non-NULL value by searching rows with the index starting from K-1
to 1
. The NULL value is replaced with the first non-NULL value found.
At least one ordering column must be defined in the window clause.
NULLS FIRST ordering of the input value is added automatically for any user-defined ordering of the input value. For example:
FORWARD_FILL(x) OVER (PARTITION BY c ORDER BY x)
- No ordering is added; ordering already exists on x
.
FORWARD_FILL(x) OVER (PARTITION BY c ORDER BY o)
- Ordering is added internally for a consistent query result.
LAG(value, offset)
Returns the value at the row that is offset rows before the current row within the partition. LAG_IN_FRAME
is the window-frame-aware version.
LAST_VALUE(value)
Returns the value from the last row of the window frame.
LEAD(value, offset)
Returns the value at the row that is offset rows after the current row within the partition. LEAD_IN_FRAME
is the window-frame-aware version.
NTH_VALUE(expr,N)
Returns a value of expr
at row N
of the window partition.
NTILE(num_buckets)
Subdivide the partition into buckets. If the total number of rows is divisible by num_buckets, each bucket has a equal number of rows. If the total is not divisible by num_buckets, the function returns groups of two sizes with a difference of 1. Window framing is ignored.
PERCENT_RANK()
Relative rank of the current row: (rank-1)/(total rows-1). Window framing is ignored.
RANK()
Rank of the current row with gaps. Equal to the row_number
of its first peer.
ROW_NUMBER()
Number of the current row within the partition, counting from 1. Window framing is ignored.
SUM_IF(condition_expr)
Aggregate function that can be used as a window function for both a nonframed window partition and a window frame. Returns the sum of all expression values satisfying the given condition_expr
. Applies to numeric data types.
Frame aggregation
MIN(val)
, MAX(val)
, COUNT(val)
, AVG(val)
, SUM(val)
Frame navigation
LEAD_IN_FRAME(value, offset)
LAG_IN_FRAME(value, offset)
FIRST_VALUE_IN_FRAME
LAST_VALUE_IN_FRAME
NTH_VALUE_IN_FRAME
These are window-frame-aware versions of the LEAD
, LAG , FIRST_VALUE, LAST_VALUE, and NTH_VALUE
functions.