HEAVY.AI Docs
v8.1.0
v8.1.0
  • Welcome to HEAVY.AI Documentation
  • Overview
    • Overview
    • Release Notes
  • Installation and Configuration
    • System Requirements
      • Hardware Reference
      • Software Requirements
      • Licensing
    • Installation
      • Free Version
      • Installing on Docker
        • HEAVY.AI Installation using Docker on Ubuntu
      • Installing on Ubuntu
        • HEAVY.AI Installation on Ubuntu
        • Install NVIDIA Drivers and Vulkan on Ubuntu
      • Installing on Rocky Linux / RHEL
        • HEAVY.AI Installation on RHEL
        • Install NVIDIA Drivers and Vulkan on Rocky Linux and RHEL
      • Getting Started on AWS
      • Getting Started on GCP
      • Getting Started on Azure
      • Getting Started on Kubernetes (BETA)
      • Upgrading
        • Upgrading HEAVY.AI
        • Upgrading from Omnisci to HEAVY.AI 6.0
        • CUDA Compatibility Drivers
      • Uninstalling
      • Ports
    • Services and Utilities
      • Using Services
      • Using Utilities
    • Executor Resource Manager
    • Configuration Parameters
      • Overview
      • Configuration Parameters for HeavyDB
      • Configuration Parameters for HEAVY.AI Web Server
      • Configuration Parameters for HeavyIQ
    • Security
      • Roles and Privileges
        • Column-Level Security
      • Connecting Using SAML
      • Implementing a Secure Binary Interface
      • Encrypted Credentials in Custom Applications
      • LDAP Integration
    • Distributed Configuration
  • Loading and Exporting Data
    • Supported Data Sources
      • Kafka
      • Using HeavyImmerse Data Manager
      • Importing Geospatial Data
    • Command Line
      • Loading Data with SQL
      • Exporting Data
  • SQL
    • Data Definition (DDL)
      • Datatypes
      • Users and Databases
      • Tables
      • System Tables
      • Views
      • Policies
      • Comment
    • Data Manipulation (DML)
      • SQL Capabilities
        • ALTER SESSION SET
        • ALTER SYSTEM CLEAR
        • DELETE
        • EXPLAIN
        • INSERT
        • KILL QUERY
        • LIKELY/UNLIKELY
        • SELECT
        • SHOW
        • UPDATE
        • Arrays
        • Logical Operators and Conditional and Subquery Expressions
        • Table Expression and Join Support
        • Type Casts
      • Geospatial Capabilities
        • Uber H3 Hexagonal Modeling
      • Functions and Operators
      • System Table Functions
        • generate_random_strings
        • generate_series
        • tf_compute_dwell_times
        • tf_feature_self_similarity
        • tf_feature_similarity
        • tf_geo_rasterize
        • tf_geo_rasterize_slope
        • tf_graph_shortest_path
        • tf_graph_shortest_paths_distances
        • tf_load_point_cloud
        • tf_mandelbrot*
        • tf_point_cloud_metadata
        • tf_raster_contour_lines; tf_raster_contour_polygons
        • tf_raster_graph_shortest_slope_weighted_path
        • tf_rf_prop_max_signal (Directional Antennas)
        • ts_rf_prop_max_signal (Isotropic Antennas)
        • tf_rf_prop
      • Window Functions
      • Reserved Words
      • SQL Extensions
      • HeavyIQ LLM_TRANSFORM
  • HeavyImmerse
    • Introduction to HeavyImmerse
    • Admin Portal
    • Control Panel
    • Working with Dashboards
      • Dashboard List
      • Creating a Dashboard
      • Configuring a Dashboard
      • Duplicating and Sharing Dashboards
    • Measures and Dimensions
    • Using Parameters
    • Using Filters
    • Using Cross-link
    • Chart Animation
    • Multilayer Charts
    • SQL Editor
    • Customization
    • Joins (Beta)
    • Chart Types
      • Overview
      • Bubble
      • Choropleth
      • Combo
      • Contour
      • Cross-Section
      • Gauge
      • Geo Heatmap
      • Heatmap
      • Linemap
      • Number
      • Pie
      • Pointmap
      • Scatter Plot
      • Skew-T
      • Table
      • Text Widget
      • Wind Barb
    • Deprecated Charts
      • Bar
      • Combo - Original
      • Histogram
      • Line
      • Stacked Bar
    • HeavyIQ SQL Notebook
  • HEAVYIQ Conversational Analytics
    • HeavyIQ Overview
      • HeavyIQ Guidance
  • HeavyRF
    • Introduction to HeavyRF
    • Getting Started
    • HeavyRF Table Functions
  • HeavyConnect
    • HeavyConnect Release Overview
    • Getting Started
    • Best Practices
    • Examples
    • Command Reference
    • Parquet Data Wrapper Reference
    • ODBC Data Wrapper Reference
    • Raster Data Wrapper Reference
  • HeavyML (BETA)
    • HeavyML Overview
    • Clustering Algorithms
    • Regression Algorithms
      • Linear Regression
      • Random Forest Regression
      • Decision Tree Regression
      • Gradient Boosting Tree Regression
    • Principal Components Analysis
  • Python / Data Science
    • Data Science Foundation
    • JupyterLab Installation and Configuration
    • Using HEAVY.AI with JupyterLab
    • Python User-Defined Functions (UDFs) with the Remote Backend Compiler (RBC)
      • Installation
      • Registering and Using a Function
      • User-Defined Table Functions
      • RBC UDF/UDTF Example Notebooks
      • General UDF/UDTF Tutorial Notebooks
      • RBC API Reference
    • Ibis
    • Interactive Data Exploration with Altair
    • Additional Examples
      • Forecasting with HEAVY.AI and Prophet
  • APIs and Interfaces
    • Overview
    • heavysql
    • Thrift
    • JDBC
    • ODBC
    • Vega
      • Vega Tutorials
        • Vega at a Glance
        • Getting Started with Vega
        • Getting More from Your Data
        • Creating More Advanced Charts
        • Using Polys Marks Type
        • Vega Accumulator
        • Using Transform Aggregation
        • Improving Rendering with SQL Extensions
      • Vega Reference Overview
        • data Property
        • projections Property
        • scales Property
        • marks Property
      • Migration
        • Migrating Vega Code to Dynamic Poly Rendering
      • Try Vega
    • RJDBC
    • SQuirreL SQL
    • heavyai-connector
  • Tutorials and Demos
    • Loading Data
    • Using Heavy Immerse
    • Hello World
    • Creating a Kafka Streaming Application
    • Getting Started with Open Source
    • Try Vega
  • Troubleshooting and Special Topics
    • FAQs
    • Troubleshooting
    • Vulkan Renderer
    • Optimizing
    • Known Issues and Limitations
    • Logs and Monitoring
    • Archived Release Notes
      • Release 6.x
      • Release 5.x
      • Release 4.x
      • Release 3.x
Powered by GitBook
On this page
Export as PDF
  1. SQL
  2. Data Manipulation (DML)
  3. System Table Functions

tf_feature_self_similarity

Given a query input of entity keys/IDs (for example, airplane tail numbers), a set of feature columns (for example, airports visited), and a metric column (for example number of times each airport was visited), scores each pair of entities based on their similarity. The score is computed as the cosine similarity of the feature column(s) between each entity pair, which can optionally be TF/IDF weighted.

select * from table(
  tf_feature_self_similarity(
    primary_features => cursor(
      select
        primary_key,
        pivot_features,
        metric
      from
        table
      group by
        primary_key,
        pivot_features
    ),
    use_tf_idf => <boolean>))

Input Arguments

Parameter
Description
Data Type

primary_key

Column containing keys/entity IDs that can be used to uniquely identify the entities for which the function computes co-similarity. Examples include countries, census block groups, user IDs of website visitors, and aircraft callsigns.

Column<TEXT ENCODING DICT | INT | BIGINT>

pivot_features

One or more columns constituting a compound feature. For example, two columns of visit hour and census block group would compare entities specified by primary_key based on whether they visited the same census block group in the same hour. If a single census block group feature column is used, the primary_key entities would be compared only by the census block groups visited, regardless of time overlap.

Column<TEXT ENCODING DICT | INT | BIGINT>

metric

Column denoting the values used as input for the cosine similarity metric computation. In many cases, this is COUNT(*) such that feature overlaps are weighted by the number of co-occurrences.

Column<INT | BIGINT | FLOAT | DOUBLE>

use_tf_idf

BOOLEAN

Output Columns

Name
Description
Data Types

class1

ID of the first primary key in the pair-wise comparison.

Column<TEXT ENCODING DICT | INT | BIGINT> (type is the same of primary_key input column)

class2

ID of the second primary key in the pair-wise comparison. Because the computed similarity score for a pair of primary keys is order-invariant, results are output only for ordering such that class1 <= class2. For primary keys of type TextEncodingDict, the order is based on the internal integer IDs for each string value and not lexicographic ordering.

Column<TEXT ENCODING DICT | INT | BIGINT> (type is the same of primary_key input column)

similarity_score

Computed cosine similarity score between each primary_key pair, with values falling between 0 (completely dissimilar) and 1 (completely similar).

Column<Float>

Example

/* Compute similarity of airlines by the airports they fly from */

select
  *
from
  table(
    tf_feature_self_similarity(
      primary_features => cursor(
        select
          carrier_name,
          origin,
          count(*) as num_flights
        from
          flights_2008
        group by
          carrier_name,
          origin
      ),
      use_tf_idf => false
    )
  )
where
  similarity_score <= 0.99
order by
  similarity_score desc
limit
  20;
  
class1|class2|similarity_score
Expressjet Airlines|Continental Air Lines|0.9564615
Delta Air Lines|Atlantic Southeast Airlines|0.9436753
Delta Air Lines|AirTran Airways Corporation|0.9379856
Atlantic Southeast Airlines|AirTran Airways Corporation|0.9326661
American Eagle Airlines|American Airlines|0.8906327
Northwest Airlines|Pinnacle Airlines|0.8222722
Skywest Airlines|United Air Lines|0.6857293
Mesa Airlines|US Airways|0.6116939
United Air Lines|Frontier Airlines|0.5921053
Mesa Airlines|United Air Lines|0.5686765
United Air Lines|American Eagle Airlines|0.5272493
Skywest Airlines|Frontier Airlines|0.4684323
Southwest Airlines|US Airways|0.4166781
United Air Lines|American Airlines|0.397027
Comair|JetBlue Airways|0.3631534
Mesa Airlines|American Eagle Airlines|0.3379275
Skywest Airlines|American Eagle Airlines|0.3331468
Mesa Airlines|Skywest Airlines|0.3235496
Comair|Delta Air Lines|0.3075919
Southwest Airlines|Mesa Airlines|0.2901711

/* Compute the similarity of US States by the TF-IDF
 weighted cosine similarity of the words tweeted in each state */
 
 select
  *
from
  table(
    tf_feature_self_similarity(
      primary_features => cursor(
        select
          state_abbr,
          unnest(tweet_tokens),
          count(*)
        from
          tweets_2022_06
        where country = 'US'
        group by
          state_abbr,
          unnest(tweet_tokens)
      ),
      use_tf_idf => TRUE
    )
  )
where
  class1 <> class2
order by
  similarity_score desc;
  
TX|GA|0.9928479
IL|TN|0.9920474
IL|NC|0.9920027
TX|IL|0.9917723
IN|OH|0.9916649
TN|NC|0.9915619
CA|TX|0.9910875
IN|VA|0.9909871
CA|IL|0.9909689
IL|OH|0.9909481
TX|NC|0.9908867
IL|MO|0.9907863
IN|MI|0.990751
TN|OH|0.9907123
IL|MD|0.9907106
OH|NC|0.9905779
VA|OH|0.990536
IN|IL|0.9904549
IN|MO|0.9903805
TX|TN|0.9903381
Previoustf_compute_dwell_timesNexttf_feature_similarity

Boolean constant denoting whether weighting should be used in the cosine similarity score computation.

TF-IDF
Computed similarity score for US airlines for 2008, where similarity is computed by the cosine similarity of the airports each airline departs from, weighted by the number of flights from that airport (using the first example query above, sans LIMIT). Dataset courtesy of the FAA.