v7.2.3 (latest)
Search
K

Table Expression and Join Support

<table> , <table> WHERE <column> = <column>
<table> [ LEFT ] JOIN <table> ON <column> = <column>
If a join column name or alias is not unique, it must be prefixed by its table name.
You can use BIGINT, INTEGER, SMALLINT, TINYINT, DATE, TIME, TIMESTAMP, or TEXT ENCODING DICT data types. TEXT ENCODING DICT is the most efficient because corresponding dictionary IDs are sequential and span a smaller range than, for example, the 65,535 values supported in a SMALLINT field. Depending on the number of values in your field, you can use TEXT ENCODING DICT(32) (up to approximately 2,150,000,000 distinct values), TEXT ENCODING DICT(16) (up to 64,000 distinct values), or TEXT ENCODING DICT(8) (up to 255 distinct values). For more information, see Data Types and Fixed Encoding.

Geospatial Joins

When possible, joins involving a geospatial operator (such as ST_Contains) build a binned spatial hash table (overlaps hash join), falling back to a Cartesian loop join if a spatial hash join cannot be constructed.
The enable-overlaps-hashjoin flag controls whether the system attempts to use the overlaps spatial join strategy (true by default). If enable-overlaps-hashjoin is set to false, or if the system cannot build an overlaps hash join table for a geospatial join operator, the system attempts to fall back to a loop join. Loop joins can be performant in situations where one or both join tables have a small number of rows. When both tables grow large, loop join performance decreases.
Two flags control whether or not the system allows loop joins for a query (geospatial for not): allow-loop-joins and trivial-loop-join-threshold. By default, allow-loop-joins is set to false and trivial-loop-join-threshold to 1,000 (rows). If allow allow-loop-joins is set to true, the system allows any query with a loop join, regardless of table cardinalities (measured in number of rows). If left to the implicit default of false or set explicitly to false, the system allows loop join queries as long as the inner table (right-side table) has fewer rows than the threshold specified by trivial-loop-join-threshold.
For optimal performance, the system should utilize overlaps hash joins whenever possible. Use the following guidelines to maximize the use of the overlaps hash join framework and minimize fallback to loop joins when conducting geospatial joins:
  • The inner (right-side) table should always be the more complicated primitive. For example, for ST_Contains(polygon, point), the point table should be the outer (left) table and the polygon table should be the inner (right) table.
  • Currently, ST_CONTAINS and ST_INTERSECTS joins between point and polygons/multi-polygon tables, and ST_DISTANCE < {distance} between two point tables are supported for accelerated overlaps hash join queries.
  • For pointwise-distance joins, only the pattern WHERE ST_DISTANCE(table_a.point_col, table_b.point_col) < distance_in_degrees supports overlaps hash joins. Patterns like the following fall back to loop joins:
    • WHERE ST_DWITHIN(table_a.point_col, table_b.point_col, distance_in_degrees)
    • WHERE ST_DISTANCE(ST_TRANSFORM(table_a.point_col, 900913), ST_TRANSFORM(table_b.point_col, 900913)) < 100

Using Joins in a Distributed Environment

You can create joins in a distributed environment in two ways:
  • Replicate small dimension tables that are used in the join.
  • Create a shard key on the column used in the join (note that there is a limit of one shard key per table). If the column involved in the join is a TEXT ENCODED field, you must create a SHARED DICTIONARY that references the FACT table key you are using to make the join.
1 small table, 1 large
2 large tables joined with shard key
2 large tables joined on TEXT ENCODED column
# Table customers is very small
CREATE TABLE sales (
id INTEGER,
customerid TEXT ENCODING DICT(32),
saledate DATE ENCODING DAYS(32),
saleamt DOUBLE);
CREATE TABLE customers (
id TEXT ENCODING DICT(32),
someid INTEGER,
name TEXT ENCODING DICT(32))
WITH (partitions = 'replicated') #this causes the entire contents of this table to be replicated to each leaf node. Only recommened for small dimension tables.
SELECT c.id, c.name from sales s inner join customers c on c.id = s.customerid limit 10;
CREATE TABLE sales (
id INTEGER,
customerid BIGINT, #note the numeric datatype, so we don't need to specify a shared dictionary on the customer table
saledate DATE ENCODING DAYS(32),
saleamt DOUBLE,
SHARD KEY (customerid))
WITH (SHARD_COUNT = <num gpus in cluster>)
CREATE TABLE customers (
id TEXT BIGINT,
someid INTEGER,
name TEXT ENCODING DICT(32)
SHARD KEY (id))
WITH (SHARD_COUNT=<num gpus in cluster>);
SELECT c.id, c.name FROM sales s INNER JOIN customers c ON c.id = s.customerid LIMIT 10;
CREATE TABLE sales (
id INTEGER,
customerid TEXT ENCODING DICT(32),
saledate DATE ENCODING DAYS(32),
saleamt DOUBLE,
SHARD KEY (customerid))
WITH (SHARD_COUNT = <num gpus in cluster>)
#note the difference when customerid is a text encoded field:
CREATE TABLE customers (
id TEXT,
someid INTEGER,
name TEXT ENCODING DICT(32),
SHARD KEY (id),
SHARED DICTIONARY (id) REFERENCES sales(customerid))
WITH (SHARD_COUNT = <num gpus in cluster>)
SELECT c.id, c.name FROM sales s INNER JOIN customers c ON c.id = s.customerid LIMIT 10;
The join order for one small table and one large table matters. If you swap the sales and customer tables on the join, it throws an exception stating that table "sales" must be replicated.
Last modified 1yr ago