Parquet Data Wrapper Reference
The following are the supported data types and configuration for use with HeavyConnect and the Parquet data format. This reference outlines prerequisites, restrictions, and supported mappings of HeavyDB column types to Parquet column types.
Parquet Data Wrapper Prerequisites & Restrictions
Type
Applicable to
Requirement or restriction
Prerequisite
Every column in every row group in every Parquet file
The metadata must exist in a usable form, with statistics that are populated. In particular, the minimum, maximum and null count must all be correctly set.
Restriction
HeavyDB foreign table max fragment size and every row group in every Parquet file
HeavyDB foreign table’s max fragment size must be greater than or equal in row count to the largest row group in any Parquet file.
Restriction
Every Parquet file schema
Every parquet file must have a flat schema (no nested types in schema definition.) The only exception to this rule is for mapping Array types to parquet lists (which are nested) described below.
Restriction
Every Parquet file schema
When more than one Parquet file is used, every file’s schema must be identical to each other & compatible with the HeavyDB foreign table schema.
Prerequisite
Every date/time type column in every Parquet file
All date/time type columns are expected to be in UTC or adjusted for UTC.
Restriction
Every decimal column mapping
Decimal column mappings are required to have the same precision and scale in both the Parquet schema and HeavyDB foreign table schema.
Restriction
Every numeric/boolean column mapping
Parquet numeric/boolean types map only to one HeavyDB type that best represents it. See mappings detailed in table below. Coercion allows for more than one mapping and is labelled as such below. Coercion can result in a loss of information, FSI attempts to detect this possibility and reports an error upon detection.
Restriction
Every numeric/boolean column mapping
Coercion to widen types is disabled due to no apparent use case for it.
Restriction
Parquet list columns
Parquet list columns must specify a schema with a max definition level of 3. This directly means that there are no nodes in the list schema that are required. Note: in this case the Parquet data wrapper will still detect such a list schema, but will throw an error indicating the max definition level is not as expected.
Restriction
Parquet scalar columns
Parquet (flat) scalar columns must have the OPTIONAL repetition type. REQUIRED is currently not supported.
HeavyDB to Parquet Data Type Mapping
Numeric and Boolean Types: Table 1
HeavyDB (Down) \ Parquet (Right)
INT(64)
INT(32)
INT(16)
INT(8)
BIGINT
Yes
No
No
No
BIGINT ENCODING FIXED (32) /
INTEGER /
Coercible
Yes
No
No
BIGINT ENCODING FIXED (16) /
INTEGER ENCODING FIXED (16) /
SMALLINT
Coercible
Coercible
Yes
No
BIGINT ENCODING FIXED (8) /
INTEGER ENCODING FIXED (8) /
SMALLINT ENCODING FIXED (8) /
TINYINT
Coercible
Coercible
Coercible
Yes
DECIMAL (Precision, Scale)
No
No
No
No
DOUBLE
No
No
No
No
FLOAT
No
No
No
No
BOOLEAN
No
No
No
No
Numeric and Boolean Types: Table 2
HeavyDB (Down) \ Parquet (Right)
Unsigned INT(64)
Unsigned INT(32)
Unsigned INT(16)
Unsigned INT(8)
BIGINT
Coercible
Yes [1]
No
No
BIGINT ENCODING FIXED (32) /
INTEGER /
Coercible
Coercible
Yes [1]
No
BIGINT ENCODING FIXED (16) /
INTEGER ENCODING FIXED (16) /
SMALLINT
Coercible
Coercible
Coercible
Yes [1]
BIGINT ENCODING FIXED (8) /
INTEGER ENCODING FIXED (8) /
SMALLINT ENCODING FIXED (8) / TINYINT
Coercible
Coercible
Coercible
Coercible
DECIMAL (Precision, Scale)
No
No
No
No
DOUBLE
No
No
No
No
FLOAT
No
No
No
No
BOOLEAN
No
No
No
No
Numeric and Boolean Types: Table 3
HeavyDB (Down) \ Parquet (Right)
DECIMAL
(Precision, Scale)
DOUBLE
FLOAT
BOOLEAN
BIGINT
No
No
No
No
BIGINT ENCODING FIXED (32) /
INTEGER /
No
No
No
No
BIGINT ENCODING FIXED (16) /
INTEGER ENCODING FIXED (16) /
SMALLINT
No
No
No
No
BIGINT ENCODING FIXED (8) /
INTEGER ENCODING FIXED (8) /
SMALLINT ENCODING FIXED (8) / TINYINT
No
No
No
No
DECIMAL (Precision, Scale)
Yes - If Precision
and Scale match No - otherwise
No
No
No
DOUBLE
No
Yes
No
No
FLOAT
No
Coercible [2]
Yes
No
BOOLEAN
No
No
No
Yes
[1] Unsigned Parquet types must map to HeavyDB signed types of one sizing larger-- for example Parquet’s Unsigned INT(32) must map to HeavyDB's BIGINT-- to ensure that no information loss occurs.
[2] Float types use 32 bits, while double types use 64 bits in their representation according to the IEEE standard. There is no check for precision loss when coercion from a double to float is requested. However, there is a check to see if the double fits in the range that float types are capable of representing.
Date and Time Types: Table 1
HeavyDB (Down) \ Parquet (Right)
DATE
TIME MILLIS (UTC)
TIME MICROS (UTC)
TIME NANOS (UTC)
DATE /
DATE ENCODING DAYS (32)
Yes
No
No
Coercible
DATE ENCODING DAYS (16) [1]
Coercible
No
No
Coercible
TIME [2]
No
Yes
Yes
Yes
TIME ENCODING FIXED (32)
No
Yes
Coercible
Coercible
TIMESTAMP (0) [3]
No
No
No
No
TIMESTAMP (3)
No
No
No
No
TIMESTAMP (6)
No
No
No
No
TIMESTAMP (9)
No
No
No
No
TIMESTAMP ENCODING FIXED (32) [4][5]
No
No
No
No
Date and Time Types: Table 2
HeavyDB (Down) \ Parquet (Right)
TIMESTAMP MILLIS (UTC)
TIMESTAMP MICROS (UTC)
TIMESTAMP NANOS (UTC)
INT64
INT32
DATE /
DATE ENCODING DAYS (32)
Coercible
Coercible
No
No
No
DATE ENCODING DAYS (16) [1]
Coercible
Coercible
No
No
No
TIME [2]
No
No
No
No
No
TIME ENCODING FIXED (32)
No
No
No
No
No
TIMESTAMP (0) [3]
Yes
Yes
Yes
Coercible
No
TIMESTAMP (3)
Yes
No
No
Coercible
No
TIMESTAMP (6)
No
Yes
No
Coercible
No
TIMESTAMP (9)
No
No
Yes
Coercible
No
TIMESTAMP ENCODING FIXED (32) [4][5]
Coercible
Coercible
Coercible
Coercible
Coercible
[1] DATE ENCODING DAYS (16) has no mapping from any Parquet type where the mapping would not result in a potential loss of information. Some mappings are allowed through coercion.
[2] The HeavyDB TIME type represents the number of seconds elapsed in a 24-hour period, while the Parquet TIME type represents a similar quantity but in milli/micro/nanoseconds. To make use of such Parquet columns, this mapping is allowed even though it breaks the convention that only direct mappings are supported. An intermediate transform takes place, calculating the number of seconds elapsed given the number of milli, micro, or nano seconds.
[3] Similar to the HeavyDB TIME type, TIMESTAMP (0) represents the time/date using seconds, which is not compatible with any Parquet TIMESTAMP types. As such, an exception is made for this case to support mapping from all of milli, micro, or nano second Parquet TIMESTAMPs.
[4] Because Parquet’s TIMESTAMP stores the data in a 64-bit representation, and the ENCODING FIXED (32) uses a 32-bit representation, information loss is possible. In these cases, no mapping is supported; however, a coercion is supported.
[5] Timestamps that use different precision with 32-bit representation (other than second precision) have a very limited range and are are not supported. Second timestamps that use a 32-bit representation have a maximum range of:8:45:53 pm UTC | Friday, December 13, 1901 to 3:14:07 am UTC | Tuesday, January 19, 2038
String Types
HeavyDB (Down) \ Parquet (Right)
STRING
ENUM (Not implemented)
UUID (Not Implemented)
BYTE_ARRAY
TEXT ENCODING DICT
Yes
Yes
Yes
Yes
TEXT ENCODING (16)
Yes
Yes
Yes
Yes
TEXT ENCODING (8)
Yes
Yes
Yes
Yes
TEXT ENCODING NONE
Yes
Yes
Yes
Yes
GeoTypes
Only geometry types in WKT format are currently supported
HeavyDB (Down) \ Parquet (Right)
STRING
BYTE_ARRAY
POINT
Yes
Yes
MULTIPOINT
Yes
Yes
LINESTRING
Yes
Yes
MULTILINESTRING
Yes
Yes
POLYGON
Yes
Yes
MULTIPOLYGON
Yes
Yes
Array Types
HeavyDB array data type maps to the Parquet list data type.
Last updated