HEAVY.AI Docs
v8.1.0
v8.1.0
  • Welcome to HEAVY.AI Documentation
  • Overview
    • Overview
    • Release Notes
  • Installation and Configuration
    • System Requirements
      • Hardware Reference
      • Software Requirements
      • Licensing
    • Installation
      • Free Version
      • Installing on Docker
        • HEAVY.AI Installation using Docker on Ubuntu
      • Installing on Ubuntu
        • HEAVY.AI Installation on Ubuntu
        • Install NVIDIA Drivers and Vulkan on Ubuntu
      • Installing on Rocky Linux / RHEL
        • HEAVY.AI Installation on RHEL
        • Install NVIDIA Drivers and Vulkan on Rocky Linux and RHEL
      • Getting Started on AWS
      • Getting Started on GCP
      • Getting Started on Azure
      • Getting Started on Kubernetes (BETA)
      • Upgrading
        • Upgrading HEAVY.AI
        • Upgrading from Omnisci to HEAVY.AI 6.0
        • CUDA Compatibility Drivers
      • Uninstalling
      • Ports
    • Services and Utilities
      • Using Services
      • Using Utilities
    • Executor Resource Manager
    • Configuration Parameters
      • Overview
      • Configuration Parameters for HeavyDB
      • Configuration Parameters for HEAVY.AI Web Server
      • Configuration Parameters for HeavyIQ
    • Security
      • Roles and Privileges
        • Column-Level Security
      • Connecting Using SAML
      • Implementing a Secure Binary Interface
      • Encrypted Credentials in Custom Applications
      • LDAP Integration
    • Distributed Configuration
  • Loading and Exporting Data
    • Supported Data Sources
      • Kafka
      • Using HeavyImmerse Data Manager
      • Importing Geospatial Data
    • Command Line
      • Loading Data with SQL
      • Exporting Data
  • SQL
    • Data Definition (DDL)
      • Datatypes
      • Users and Databases
      • Tables
      • System Tables
      • Views
      • Policies
      • Comment
    • Data Manipulation (DML)
      • SQL Capabilities
        • ALTER SESSION SET
        • ALTER SYSTEM CLEAR
        • DELETE
        • EXPLAIN
        • INSERT
        • KILL QUERY
        • LIKELY/UNLIKELY
        • SELECT
        • SHOW
        • UPDATE
        • Arrays
        • Logical Operators and Conditional and Subquery Expressions
        • Table Expression and Join Support
        • Type Casts
      • Geospatial Capabilities
        • Uber H3 Hexagonal Modeling
      • Functions and Operators
      • System Table Functions
        • generate_random_strings
        • generate_series
        • tf_compute_dwell_times
        • tf_feature_self_similarity
        • tf_feature_similarity
        • tf_geo_rasterize
        • tf_geo_rasterize_slope
        • tf_graph_shortest_path
        • tf_graph_shortest_paths_distances
        • tf_load_point_cloud
        • tf_mandelbrot*
        • tf_point_cloud_metadata
        • tf_raster_contour_lines; tf_raster_contour_polygons
        • tf_raster_graph_shortest_slope_weighted_path
        • tf_rf_prop_max_signal (Directional Antennas)
        • ts_rf_prop_max_signal (Isotropic Antennas)
        • tf_rf_prop
      • Window Functions
      • Reserved Words
      • SQL Extensions
      • HeavyIQ LLM_TRANSFORM
  • HeavyImmerse
    • Introduction to HeavyImmerse
    • Admin Portal
    • Control Panel
    • Working with Dashboards
      • Dashboard List
      • Creating a Dashboard
      • Configuring a Dashboard
      • Duplicating and Sharing Dashboards
    • Measures and Dimensions
    • Using Parameters
    • Using Filters
    • Using Cross-link
    • Chart Animation
    • Multilayer Charts
    • SQL Editor
    • Customization
    • Joins (Beta)
    • Chart Types
      • Overview
      • Bubble
      • Choropleth
      • Combo
      • Contour
      • Cross-Section
      • Gauge
      • Geo Heatmap
      • Heatmap
      • Linemap
      • Number
      • Pie
      • Pointmap
      • Scatter Plot
      • Skew-T
      • Table
      • Text Widget
      • Wind Barb
    • Deprecated Charts
      • Bar
      • Combo - Original
      • Histogram
      • Line
      • Stacked Bar
    • HeavyIQ SQL Notebook
  • HEAVYIQ Conversational Analytics
    • HeavyIQ Overview
      • HeavyIQ Guidance
  • HeavyRF
    • Introduction to HeavyRF
    • Getting Started
    • HeavyRF Table Functions
  • HeavyConnect
    • HeavyConnect Release Overview
    • Getting Started
    • Best Practices
    • Examples
    • Command Reference
    • Parquet Data Wrapper Reference
    • ODBC Data Wrapper Reference
    • Raster Data Wrapper Reference
  • HeavyML (BETA)
    • HeavyML Overview
    • Clustering Algorithms
    • Regression Algorithms
      • Linear Regression
      • Random Forest Regression
      • Decision Tree Regression
      • Gradient Boosting Tree Regression
    • Principal Components Analysis
  • Python / Data Science
    • Data Science Foundation
    • JupyterLab Installation and Configuration
    • Using HEAVY.AI with JupyterLab
    • Python User-Defined Functions (UDFs) with the Remote Backend Compiler (RBC)
      • Installation
      • Registering and Using a Function
      • User-Defined Table Functions
      • RBC UDF/UDTF Example Notebooks
      • General UDF/UDTF Tutorial Notebooks
      • RBC API Reference
    • Ibis
    • Interactive Data Exploration with Altair
    • Additional Examples
      • Forecasting with HEAVY.AI and Prophet
  • APIs and Interfaces
    • Overview
    • heavysql
    • Thrift
    • JDBC
    • ODBC
    • Vega
      • Vega Tutorials
        • Vega at a Glance
        • Getting Started with Vega
        • Getting More from Your Data
        • Creating More Advanced Charts
        • Using Polys Marks Type
        • Vega Accumulator
        • Using Transform Aggregation
        • Improving Rendering with SQL Extensions
      • Vega Reference Overview
        • data Property
        • projections Property
        • scales Property
        • marks Property
      • Migration
        • Migrating Vega Code to Dynamic Poly Rendering
      • Try Vega
    • RJDBC
    • SQuirreL SQL
    • heavyai-connector
  • Tutorials and Demos
    • Loading Data
    • Using Heavy Immerse
    • Hello World
    • Creating a Kafka Streaming Application
    • Getting Started with Open Source
    • Try Vega
  • Troubleshooting and Special Topics
    • FAQs
    • Troubleshooting
    • Vulkan Renderer
    • Optimizing
    • Known Issues and Limitations
    • Logs and Monitoring
    • Archived Release Notes
      • Release 6.x
      • Release 5.x
      • Release 4.x
      • Release 3.x
Powered by GitBook
On this page
Export as PDF
  1. HeavyML (BETA)

Principal Components Analysis

Principal Component Analysis (PCA) is a statistical technique commonly used for dimensionality reduction and data compression. It is an unsupervised technique, and thus requires no training data. It's particularly useful when dealing with high-dimensional data, such as multispectral and hyperspectral data. These types of data are often encountered in fields like remote sensing, image processing, and spectroscopy, where each pixel or data point contains information from multiple bands or wavelengths.

1. Dimensionality Reduction: Multispectral and hyperspectral data typically contain a large number of bands or wavelengths, leading to high-dimensional data. PCA works by transforming the original data into a new set of variables called principal components (PCs). These PCs are linear combinations of the original bands and are sorted in order of their variance. The first PC captures the most significant variance in the data, the second PC captures the second most significant variance, and so on. By selecting a subset of these PCs, you can reduce the dimensionality of the data while retaining most of its important information.

2. Data Compression: PCA can also be used for data compression. By selecting a smaller number of principal components to represent the data, you effectively reduce the amount of storage or memory required to store the data. This is especially useful when dealing with large datasets where storage and processing resources are limited.

3. Data Visualization: PCA can help visualize high-dimensional data in lower dimensions. While original data with numerous bands can be difficult to visualize, projecting the data onto the first few principal components allows you to create scatter plots or graphs that provide insights into data clusters, patterns, and anomalies.

4. Data Classification: When dealing with multispectral and hyperspectral data, PCA can aid in data classification tasks. High-dimensional data can lead to overfitting and computational challenges in classification algorithms. By applying PCA, you can reduce the dimensionality of the data while preserving the most relevant information. This often results in improved classification performance, reduced computational requirements, and better generalization.

Here's how PCA can be used for classification with multispectral and hyperspectral data:

  1. Preprocessing: The first step involves preprocessing the data to remove noise, correct artifacts, and normalize the values across different bands. In particular, the PCA algorithm used in HeavyML does not support NULL data values. So if you have such values in your dataset you will need to either delete those records or impute values to fill the NULLs.

  2. PCA Transformation: The preprocessed data is then subjected to PCA transformation to obtain the principal components. The number of components chosen depends on how much variance you want to retain and the trade-off between dimensionality reduction and information preservation. We recommend in most cases that you start with 3 components, since this can be visualized using a scatter plot with two dimensions and the third dimension used as a color measure. However if your downstream purpose is analytical, you should adjust this to retain the variance required.

  3. Training and Classification: The reduced-dimensional data, represented by the selected principal components, is used as input for classification algorithms. These algorithms (e.g., Support Vector Machines, Random Forests, Neural Networks) are trained on labeled data to learn the relationships between the principal components and the classes. Fior example, with hyperspectral data containing hundreds of bands, it is common practice to use PCA first, and then to apply techniques such as random forests regression to the resulting PCA bands. This can increase the likelihood of a model converging, and is certain to reduce run times and memory use substantially.

  4. Testing and Prediction: Once the classifier is trained, it can be used to classify new, unseen data points based on their reduced-dimensional representations obtained from PCA. This is conceptually similar to categorical classification, but transforms new data into a continuous value along each PCA axis.

By using PCA for dimensionality reduction and subsequent classification, you can effectively handle the challenges posed by high-dimensional multispectral and hyperspectral data, leading to improved classification accuracy and more efficient computation.

Method

There are two steps to using PCA within HeavyML. First you create a model, and then you run predictions using that model.

For example, imagine that we want to visualize hyperspectral data from the ENMAP satellite which contains up to 244 bands. For brevity, let's build a PCA model on the first four bands:

CREATE MODEL enmap_hyperspectral_pca OF TYPE PCA AS 
SELECT band_1_1, band_1_2, band_1_3, band_1_4
FROM enmap_hyperspectral

The syntax is identical to other HeavyML model creation steps, except that the type is given as PCA. Bands can be specified in any order, as long as that is consistent between model creation and prediction steps. If the command above is successful, you can use the following command to verify successful model creation:

SHOW MODEL DETAILS 'enmap_hyperspectral_2022_pca_v2' 

Running predictions requires three types of parameters: the model name, the bands, and the desired PCA component (1..num_bands). For example for the first four bands

SELECT raster_lon, raster_lat, 
PCA_PROJECT('enmap_hyperspectral_pca', band_1_1, band_1_2, band_1_3, band_1_4, 1), 
PCA_PROJECT('enmap_hyperspectral_pca', band_1_1, band_1_2, band_1_3, band_1_4, 2),
PCA_PROJECT('enmap_hyperspectral_pca', band_1_1, band_1_2, band_1_3, band_1_4, 1) 
FROM  enmap_hyperspectral

The above command generates the PCA values on-the-fly. A fragment like that above could also be used to provide the data for training any other HeavyML model. But you can also persist PCA values by projecting them within a CREATE TABLE AS SELECT statement or using the UPDATE command.

PreviousGradient Boosting Tree RegressionNextData Science Foundation