Given a query input of entity keys, feature columns, and a metric column, and a second query input specifying a search vector of feature columns and metric, computes the similarity of each entity in the first input to the search vector based on their similarity. The score is computed as the cosine similarity of the feature column(s) for each entity with the feature column(s) for the search vector, which can optionally be TF/IDF weighted.
SELECT
*
FROM
TABLE(
tf_feature_similarity(
primary_features => CURSOR(
SELECT
primary_key,
pivot_features,
metric
from
table
where
...
group by
primary_key,
pivot_features
),
comparison_features => CURSOR(
SELECT
comparison_metric
from
table
where
...
group by <column>
),
use_tf_idf => <boolean>
)
)
Input Arguments
Output Columns
Example
/* Compute the similarity of US airline flight nums to a particular
Delta flight (DAL795) based on the cosine similarity of the overlap of
flight paths binned to a H3 Hex at zoom level 7 (roughly 5 sq km),
and return the top 10 most similar flight nums */
SELECT
*
FROM
TABLE(
tf_feature_similarity(
primary_features => CURSOR(
SELECT
callsign,
geotoh3(st_x(location), st_y(location), 7) as h3,
count(*) as n
from
adsb_2021_03_01
where
operator in (
'Delta Air Lines',
'Alaska Airlines',
'Southwest Airlines',
'American Airlines',
'United Airlines'
)
and altitude >= 1000
group by
callsign,
h3
),
comparison_features => CURSOR(
SELECT
geotoh3(st_x(location), st_y(location), 7) as h3,
COUNT(*) as n
from
adsb_2021_03_01
where
callsign = 'DAL795'
and altitude >= 1000
group by
h3
),
use_tf_idf => false
)
)
ORDER BY
similarity_score desc
limit
10;
class|similarity_score
DAL795|1
DAL538|0.610889
DAL1192|0.3419932
DAL1185|0.3391671
SWA4346|0.3206964
DAL365|0.3037131
SWA953|0.2912168
UAL1559|0.2747431
SWA2098|0.2511763
DAL526|0.2473387
Column containing keys/entity IDs that can be used to uniquely identify the entities for which the function will compute the similarity to the search vector specified by the comparison_features cursor. Examples include countries, census block groups, user IDs of website visitors, and aircraft call signs.
Column<TEXT ENCODING DICT | INT | BIGINT>
pivot_features
One or more columns constituting a compound feature. For example, two columns of visit hour and census block group would compare entities specified by primary_key based on whether they visited the same census block group in the same hour. If a single census block group feature column is used, the primary_key entities are compared only by the census block groups visited, regardless of time overlap.
Column<TEXT ENCODING DICT | INT | BIGINT>
metric
Column denoting the values used as input for the cosine similarity metric computation. In many cases, this is simply COUNT(*) such that feature overlaps are weighted by the number of co-occurrences.
Column<INT | BIGINT | FLOAT | DOUBLE>
comparison_pivot_features
One or more columns constituting a compound feature for the search vector. This should match in number of sub-features, types, and semantics pivot features.
Column<TEXT ENCODING DICT | INT | BIGINT>
comparison_metric
Column denoting the values used as input for the cosine similarity metric computation from the search vector. In many cases, this is simply COUNT(*) such that feature overlaps are weighted by the number of co-occurrences.
Column<TEXT ENCODING DICT | INT | BIGINT>
use_tf_idf
Boolean constant denoting whether TF-IDF weighting should be used in the cosine similarity score computation.
BOOLEAN
class
ID of the primary key being compared against the search vector.
Column<TEXT ENCODING DICT | INT | BIGINT> (type will be the same of primary_key input column)
similarity_score
Computed cosine similarity score between each primary_key pair, with values falling between 0 (completely dissimilar) and 1 (completely similar).