tf_feature_self_similarity

Given a query input of entity keys/IDs (for example, airplane tail numbers), a set of feature columns (for example, airports visited), and a metric column (for example number of times each airport was visited), scores each pair of entities based on their similarity. The score is computed as the cosine similarity of the feature column(s) between each entity pair, which can optionally be TF/IDF weighted.

select * from table(
  tf_feature_self_similarity(
    primary_features => cursor(
      select
        primary_key,
        pivot_features,
        metric
      from
        table
      group by
        primary_key,
        pivot_features
    ),
    use_tf_idf => <boolean>))

Input Arguments

Parameter
Description
Data Type

primary_key

Column containing keys/entity IDs that can be used to uniquely identify the entities for which the function computes co-similarity. Examples include countries, census block groups, user IDs of website visitors, and aircraft callsigns.

Column<TEXT ENCODING DICT | INT | BIGINT>

pivot_features

One or more columns constituting a compound feature. For example, two columns of visit hour and census block group would compare entities specified by primary_key based on whether they visited the same census block group in the same hour. If a single census block group feature column is used, the primary_key entities would be compared only by the census block groups visited, regardless of time overlap.

Column<TEXT ENCODING DICT | INT | BIGINT>

metric

Column denoting the values used as input for the cosine similarity metric computation. In many cases, this is COUNT(*) such that feature overlaps are weighted by the number of co-occurrences.

Column<INT | BIGINT | FLOAT | DOUBLE>

use_tf_idf

Boolean constant denoting whether TF-IDF weighting should be used in the cosine similarity score computation.

BOOLEAN

Output Columns

Name
Description
Data Types

class1

ID of the first primary key in the pair-wise comparison.

Column<TEXT ENCODING DICT | INT | BIGINT> (type is the same of primary_key input column)

class2

ID of the second primary key in the pair-wise comparison. Because the computed similarity score for a pair of primary keys is order-invariant, results are output only for ordering such that class1 <= class2. For primary keys of type TextEncodingDict, the order is based on the internal integer IDs for each string value and not lexicographic ordering.

Column<TEXT ENCODING DICT | INT | BIGINT> (type is the same of primary_key input column)

similarity_score

Computed cosine similarity score between each primary_key pair, with values falling between 0 (completely dissimilar) and 1 (completely similar).

Column<Float>

Example

Computed similarity score for US airlines for 2008, where similarity is computed by the cosine similarity of the airports each airline departs from, weighted by the number of flights from that airport (using the first example query above, sans LIMIT). Dataset courtesy of the FAA.

Last updated