If a join column name or alias is not unique, it must be prefixed by its table name.
You can use BIGINT, INTEGER, SMALLINT, TINYINT, DATE, TIME, TIMESTAMP, or TEXT ENCODING DICT data types. TEXT ENCODING DICT is the most efficient because corresponding dictionary IDs are sequential and span a smaller range than, for example, the 65,535 values supported in a SMALLINT field. Depending on the number of values in your field, you can use TEXT ENCODING DICT(32) (up to approximately 2,150,000,000 distinct values), TEXT ENCODING DICT(16) (up to 64,000 distinct values), or TEXT ENCODING DICT(8) (up to 255 distinct values). For more information, see Data Types and Fixed Encoding.
When possible, joins involving a geospatial operator (such as ST_Contains
) build a binned spatial hash table (overlaps hash join), falling back to a Cartesian loop join if a spatial hash join cannot be constructed.
The enable-overlaps-hashjoin
flag controls whether the system attempts to use the overlaps spatial join strategy (true
by default). If enable-overlaps-hashjoin
is set to false, or if the system cannot build an overlaps hash join table for a geospatial join operator, the system attempts to fall back to a loop join. Loop joins can be performant in situations where one or both join tables have a small number of rows. When both tables grow large, loop join performance decreases.
Two flags control whether or not the system allows loop joins for a query (geospatial for not): allow-loop-joins
and trivial-loop-join-threshold
. By default, allow-loop-joins
is set to false
and trivial-loop-join-threshold
to 1,000 (rows). If allow allow-loop-joins
is set to true
, the system allows any query with a loop join, regardless of table cardinalities (measured in number of rows). If left to the implicit default of false
or set explicitly to false
, the system allows loop join queries as long as the inner table (right-side table) has fewer rows than the threshold specified by trivial-loop-join-threshold
.
For optimal performance, the system should utilize overlaps hash joins whenever possible. Use the following guidelines to maximize the use of the overlaps hash join framework and minimize fallback to loop joins when conducting geospatial joins:
The inner (right-side) table should always be the more complicated primitive. For example, for ST_Contains(polygon, point)
, the point table should be the outer (left) table and the polygon table should be the inner (right) table.
Currently, ST_CONTAINS
and ST_INTERSECTS
joins between point and polygons/multi-polygon tables, and ST_DISTANCE < {distance}
between two point tables are supported for accelerated overlaps hash join queries.
For pointwise-distance joins, only the pattern WHERE ST_DISTANCE(table_a.point_col, table_b.point_col) < distance_in_degrees
supports overlaps hash joins. Patterns like the following fall back to loop joins:
WHERE ST_DWITHIN(table_a.point_col, table_b.point_col, distance_in_degrees)
WHERE ST_DISTANCE(ST_TRANSFORM(table_a.point_col, 900913), ST_TRANSFORM(table_b.point_col, 900913)) < 100
You can create joins in a distributed environment in two ways:
Replicate small dimension tables that are used in the join.
Create a shard key on the column used in the join (note that there is a limit of one shard key per table). If the column involved in the join is a TEXT ENCODED field, you must create a SHARED DICTIONARY that references the FACT table key you are using to make the join.
The join order for one small table and one large table matters. If you swap the sales and customer tables on the join, it throws an exception stating that table "sales" must be replicated.