Parquet Data Wrapper Reference
The following are the supported data types and configuration for use with HeavyConnect and the Parquet data format. This reference outlines prerequisites, restrictions, and supported mappings of HeavyDB column types to Parquet column types.
Parquet Data Wrapper Prerequisites & Restrictions
Type | Applicable to | Requirement or restriction |
Prerequisite | Every column in every row group in every Parquet file | The metadata must exist in a usable form, with statistics that are populated. In particular, the minimum, maximum and null count must all be correctly set. |
Restriction | HeavyDB foreign table max fragment size and every row group in every Parquet file | HeavyDB foreign table’s max fragment size must be greater than or equal in row count to the largest row group in any Parquet file. |
Restriction | Every Parquet file schema | Every parquet file must have a flat schema (no nested types in schema definition.) The only exception to this rule is for mapping Array types to parquet lists (which are nested) described below. |
Restriction | Every Parquet file schema | When more than one Parquet file is used, every file’s schema must be identical to each other & compatible with the HeavyDB foreign table schema. |
Prerequisite | Every date/time type column in every Parquet file | All date/time type columns are expected to be in UTC or adjusted for UTC. |
Restriction | Every decimal column mapping | Decimal column mappings are required to have the same precision and scale in both the Parquet schema and HeavyDB foreign table schema. |
Restriction | Every numeric/boolean column mapping | Parquet numeric/boolean types map only to one HeavyDB type that best represents it. See mappings detailed in table below. Coercion allows for more than one mapping and is labelled as such below. Coercion can result in a loss of information, FSI attempts to detect this possibility and reports an error upon detection. |
Restriction | Every numeric/boolean column mapping | Coercion to widen types is disabled due to no apparent use case for it. |
Restriction | Parquet list columns | Parquet list columns must specify a schema with a max definition level of 3. This directly means that there are no nodes in the list schema that are required. Note: in this case the Parquet data wrapper will still detect such a list schema, but will throw an error indicating the max definition level is not as expected. |
Restriction | Parquet scalar columns | Parquet (flat) scalar columns must have the OPTIONAL repetition type. REQUIRED is currently not supported. |
HeavyDB to Parquet Data Type Mapping
Numeric and Boolean Types: Table 1
HeavyDB (Down) \ Parquet (Right) | INT(64) | INT(32) | INT(16) | INT(8) |
BIGINT | Yes | No | No | No |
BIGINT ENCODING FIXED (32) / INTEGER / | Coercible | Yes | No | No |
BIGINT ENCODING FIXED (16) / INTEGER ENCODING FIXED (16) / SMALLINT | Coercible | Coercible | Yes | No |
BIGINT ENCODING FIXED (8) / INTEGER ENCODING FIXED (8) / SMALLINT ENCODING FIXED (8) / TINYINT | Coercible | Coercible | Coercible | Yes |
DECIMAL (Precision, Scale) | No | No | No | No |
DOUBLE | No | No | No | No |
FLOAT | No | No | No | No |
BOOLEAN | No | No | No | No |
Numeric and Boolean Types: Table 2
HeavyDB (Down) \ Parquet (Right) | Unsigned INT(64) | Unsigned INT(32) | Unsigned INT(16) | Unsigned INT(8) |
BIGINT | Coercible | Yes [1] | No | No |
BIGINT ENCODING FIXED (32) / INTEGER / | Coercible | Coercible | Yes [1] | No |
BIGINT ENCODING FIXED (16) / INTEGER ENCODING FIXED (16) / SMALLINT | Coercible | Coercible | Coercible | Yes [1] |
BIGINT ENCODING FIXED (8) / INTEGER ENCODING FIXED (8) / SMALLINT ENCODING FIXED (8) / TINYINT | Coercible | Coercible | Coercible | Coercible |
DECIMAL (Precision, Scale) | No | No | No | No |
DOUBLE | No | No | No | No |
FLOAT | No | No | No | No |
BOOLEAN | No | No | No | No |
Numeric and Boolean Types: Table 3
HeavyDB (Down) \ Parquet (Right) | DECIMAL (Precision, Scale) | DOUBLE | FLOAT | BOOLEAN |
BIGINT | No | No | No | No |
BIGINT ENCODING FIXED (32) / INTEGER / | No | No | No | No |
BIGINT ENCODING FIXED (16) / INTEGER ENCODING FIXED (16) / SMALLINT | No | No | No | No |
BIGINT ENCODING FIXED (8) / INTEGER ENCODING FIXED (8) / SMALLINT ENCODING FIXED (8) / TINYINT | No | No | No | No |
DECIMAL (Precision, Scale) | Yes - If Precision and Scale match No - otherwise | No | No | No |
DOUBLE | No | Yes | No | No |
FLOAT | No | Coercible [2] | Yes | No |
BOOLEAN | No | No | No | Yes |
[1] Unsigned Parquet types must map to HeavyDB signed types of one sizing larger-- for example Parquet’s Unsigned INT(32) must map to HeavyDB's BIGINT-- to ensure that no information loss occurs.
[2] Float types use 32 bits, while double types use 64 bits in their representation according to the IEEE standard. There is no check for precision loss when coercion from a double to float is requested. However, there is a check to see if the double fits in the range that float types are capable of representing.
Date and Time Types: Table 1
HeavyDB (Down) \ Parquet (Right) | DATE | TIME MILLIS (UTC) | TIME MICROS (UTC) | TIME NANOS (UTC) |
DATE / DATE ENCODING DAYS (32) | Yes | No | No | Coercible |
DATE ENCODING DAYS (16) [1] | Coercible | No | No | Coercible |
TIME [2] | No | Yes | Yes | Yes |
TIME ENCODING FIXED (32) | No | Yes | Coercible | Coercible |
TIMESTAMP (0) [3] | No | No | No | No |
TIMESTAMP (3) | No | No | No | No |
TIMESTAMP (6) | No | No | No | No |
TIMESTAMP (9) | No | No | No | No |
TIMESTAMP ENCODING FIXED (32) [4][5] | No | No | No | No |
Date and Time Types: Table 2
HeavyDB (Down) \ Parquet (Right) | TIMESTAMP MILLIS (UTC) | TIMESTAMP MICROS (UTC) | TIMESTAMP NANOS (UTC) | INT64 | INT32 |
DATE / DATE ENCODING DAYS (32) | Coercible | Coercible | No | No | No |
DATE ENCODING DAYS (16) [1] | Coercible | Coercible | No | No | No |
TIME [2] | No | No | No | No | No |
TIME ENCODING FIXED (32) | No | No | No | No | No |
TIMESTAMP (0) [3] | Yes | Yes | Yes | Coercible | No |
TIMESTAMP (3) | Yes | No | No | Coercible | No |
TIMESTAMP (6) | No | Yes | No | Coercible | No |
TIMESTAMP (9) | No | No | Yes | Coercible | No |
TIMESTAMP ENCODING FIXED (32) [4][5] | Coercible | Coercible | Coercible | Coercible | Coercible |
[1] DATE ENCODING DAYS (16) has no mapping from any Parquet type where the mapping would not result in a potential loss of information. Some mappings are allowed through coercion.
[2] The HeavyDB TIME type represents the number of seconds elapsed in a 24-hour period, while the Parquet TIME type represents a similar quantity but in milli/micro/nanoseconds. To make use of such Parquet columns, this mapping is allowed even though it breaks the convention that only direct mappings are supported. An intermediate transform takes place, calculating the number of seconds elapsed given the number of milli, micro, or nano seconds.
[3] Similar to the HeavyDB TIME type, TIMESTAMP (0) represents the time/date using seconds, which is not compatible with any Parquet TIMESTAMP types. As such, an exception is made for this case to support mapping from all of milli, micro, or nano second Parquet TIMESTAMPs.
[4] Because Parquet’s TIMESTAMP stores the data in a 64-bit representation, and the ENCODING FIXED (32) uses a 32-bit representation, information loss is possible. In these cases, no mapping is supported; however, a coercion is supported.
[5] Timestamps that use different precision with 32-bit representation (other than second precision) have a very limited range and are are not supported. Second timestamps that use a 32-bit representation have a maximum range of:8:45:53 pm UTC | Friday, December 13, 1901 to 3:14:07 am UTC | Tuesday, January 19, 2038
String Types
HeavyDB (Down) \ Parquet (Right) | STRING | ENUM (Not implemented) | UUID (Not Implemented) | BYTE_ARRAY |
TEXT ENCODING DICT | Yes | Yes | Yes | Yes |
TEXT ENCODING (16) | Yes | Yes | Yes | Yes |
TEXT ENCODING (8) | Yes | Yes | Yes | Yes |
TEXT ENCODING NONE | Yes | Yes | Yes | Yes |
GeoTypes
Only geometry types in WKT format are currently supported
HeavyDB (Down) \ Parquet (Right) | STRING | BYTE_ARRAY |
POINT | Yes | Yes |
MULTIPOINT | Yes | Yes |
LINESTRING | Yes | Yes |
MULTILINESTRING | Yes | Yes |
POLYGON | Yes | Yes |
MULTIPOLYGON | Yes | Yes |
Array Types
HeavyDB array data type maps to the Parquet list data type.
Last updated