Parquet Data Wrapper Reference

The following are the supported data types and configuration for use with HeavyConnect and the Parquet data format. This reference outlines prerequisites, restrictions, and supported mappings of HeavyDB column types to Parquet column types.

Parquet Data Wrapper Prerequisites & Restrictions

Type

Applicable to

Requirement or restriction

Prerequisite

Every column in every row group in every Parquet file

The metadata must exist in a usable form, with statistics that are populated. In particular, the minimum, maximum and null count must all be correctly set.

Restriction

HeavyDB foreign table max fragment size and every row group in every Parquet file

HeavyDB foreign table’s max fragment size must be greater than or equal in row count to the largest row group in any Parquet file.

Restriction

Every Parquet file schema

Every parquet file must have a flat schema (no nested types in schema definition.) The only exception to this rule is for mapping Array types to parquet lists (which are nested) described below.

Restriction

Every Parquet file schema

When more than one Parquet file is used, every file’s schema must be identical to each other & compatible with the HeavyDB foreign table schema.

Prerequisite

Every date/time type column in every Parquet file

All date/time type columns are expected to be in UTC or adjusted for UTC.

Restriction

Every decimal column mapping

Decimal column mappings are required to have the same precision and scale in both the Parquet schema and HeavyDB foreign table schema.

Restriction

Every numeric/boolean column mapping

Parquet numeric/boolean types map only to one HeavyDB type that best represents it. See mappings detailed in table below. Coercion allows for more than one mapping and is labelled as such below. Coercion can result in a loss of information, FSI attempts to detect this possibility and reports an error upon detection.

Restriction

Every numeric/boolean column mapping

Coercion to widen types is disabled due to no apparent use case for it.

Restriction

Parquet list columns

Parquet list columns must specify a schema with a max definition level of 3. This directly means that there are no nodes in the list schema that are required. Note: in this case the Parquet data wrapper will still detect such a list schema, but will throw an error indicating the max definition level is not as expected.

Restriction

Parquet scalar columns

Parquet (flat) scalar columns must have the OPTIONAL repetition type. REQUIRED is currently not supported.

HeavyDB to Parquet Data Type Mapping

Numeric and Boolean Types: Table 1

HeavyDB (Down) \ Parquet (Right)

INT(64)

INT(32)

INT(16)

INT(8)

BIGINT

Yes

No

No

No

BIGINT ENCODING FIXED (32) /

INTEGER /

Coercible

Yes

No

No

BIGINT ENCODING FIXED (16) /

INTEGER ENCODING FIXED (16) /

SMALLINT

Coercible

Coercible

Yes

No

BIGINT ENCODING FIXED (8) /

INTEGER ENCODING FIXED (8) /

SMALLINT ENCODING FIXED (8) /

TINYINT

Coercible

Coercible

Coercible

Yes

DECIMAL (Precision, Scale)

No

No

No

No

DOUBLE

No

No

No

No

FLOAT

No

No

No

No

BOOLEAN

No

No

No

No

Numeric and Boolean Types: Table 2

HeavyDB (Down) \ Parquet (Right)

Unsigned INT(64)

Unsigned INT(32)

Unsigned INT(16)

Unsigned INT(8)

BIGINT

Coercible

Yes [1]

No

No

BIGINT ENCODING FIXED (32) /

INTEGER /

Coercible

Coercible

Yes [1]

No

BIGINT ENCODING FIXED (16) /

INTEGER ENCODING FIXED (16) /

SMALLINT

Coercible

Coercible

Coercible

Yes [1]

BIGINT ENCODING FIXED (8) /

INTEGER ENCODING FIXED (8) /

SMALLINT ENCODING FIXED (8) / TINYINT

Coercible

Coercible

Coercible

Coercible

DECIMAL (Precision, Scale)

No

No

No

No

DOUBLE

No

No

No

No

FLOAT

No

No

No

No

BOOLEAN

No

No

No

No

Numeric and Boolean Types: Table 3

HeavyDB (Down) \ Parquet (Right)

DECIMAL

(Precision, Scale)

DOUBLE

FLOAT

BOOLEAN

BIGINT

No

No

No

No

BIGINT ENCODING FIXED (32) /

INTEGER /

No

No

No

No

BIGINT ENCODING FIXED (16) /

INTEGER ENCODING FIXED (16) /

SMALLINT

No

No

No

No

BIGINT ENCODING FIXED (8) /

INTEGER ENCODING FIXED (8) /

SMALLINT ENCODING FIXED (8) / TINYINT

No

No

No

No

DECIMAL (Precision, Scale)

Yes - If Precision

and Scale match No - otherwise

No

No

No

DOUBLE

No

Yes

No

No

FLOAT

No

Coercible [2]

Yes

No

BOOLEAN

No

No

No

Yes

[1] Unsigned Parquet types must map to HeavyDB signed types of one sizing larger-- for example Parquet’s Unsigned INT(32) must map to HeavyDB's BIGINT-- to ensure that no information loss occurs.

[2] Float types use 32 bits, while double types use 64 bits in their representation according to the IEEE standard. There is no check for precision loss when coercion from a double to float is requested. However, there is a check to see if the double fits in the range that float types are capable of representing.

Date and Time Types: Table 1

HeavyDB (Down) \ Parquet (Right)

DATE

TIME MILLIS (UTC)

TIME MICROS (UTC)

TIME NANOS (UTC)

DATE /

DATE ENCODING DAYS (32)

Yes

No

No

Coercible

DATE ENCODING DAYS (16) [1]

Coercible

No

No

Coercible

TIME [2]

No

Yes

Yes

Yes

TIME ENCODING FIXED (32)

No

Yes

Coercible

Coercible

TIMESTAMP (0) [3]

No

No

No

No

TIMESTAMP (3)

No

No

No

No

TIMESTAMP (6)

No

No

No

No

TIMESTAMP (9)

No

No

No

No

TIMESTAMP ENCODING FIXED (32) [4][5]

No

No

No

No

Date and Time Types: Table 2

HeavyDB (Down) \ Parquet (Right)

TIMESTAMP MILLIS (UTC)

TIMESTAMP MICROS (UTC)

TIMESTAMP NANOS (UTC)

INT64

INT32

DATE /

DATE ENCODING DAYS (32)

Coercible

Coercible

No

No

No

DATE ENCODING DAYS (16) [1]

Coercible

Coercible

No

No

No

TIME [2]

No

No

No

No

No

TIME ENCODING FIXED (32)

No

No

No

No

No

TIMESTAMP (0) [3]

Yes

Yes

Yes

Coercible

No

TIMESTAMP (3)

Yes

No

No

Coercible

No

TIMESTAMP (6)

No

Yes

No

Coercible

No

TIMESTAMP (9)

No

No

Yes

Coercible

No

TIMESTAMP ENCODING FIXED (32) [4][5]

Coercible

Coercible

Coercible

Coercible

Coercible

[1] DATE ENCODING DAYS (16) has no mapping from any Parquet type where the mapping would not result in a potential loss of information. Some mappings are allowed through coercion.

[2] The HeavyDB TIME type represents the number of seconds elapsed in a 24-hour period, while the Parquet TIME type represents a similar quantity but in milli/micro/nanoseconds. To make use of such Parquet columns, this mapping is allowed even though it breaks the convention that only direct mappings are supported. An intermediate transform takes place, calculating the number of seconds elapsed given the number of milli, micro, or nano seconds.

[3] Similar to the HeavyDB TIME type, TIMESTAMP (0) represents the time/date using seconds, which is not compatible with any Parquet TIMESTAMP types. As such, an exception is made for this case to support mapping from all of milli, micro, or nano second Parquet TIMESTAMPs.

[4] Because Parquet’s TIMESTAMP stores the data in a 64-bit representation, and the ENCODING FIXED (32) uses a 32-bit representation, information loss is possible. In these cases, no mapping is supported; however, a coercion is supported.

[5] Timestamps that use different precision with 32-bit representation (other than second precision) have a very limited range and are are not supported. Second timestamps that use a 32-bit representation have a maximum range of:8:45:53 pm UTC | Friday, December 13, 1901 to 3:14:07 am UTC | Tuesday, January 19, 2038

String Types

HeavyDB (Down) \ Parquet (Right)

STRING

ENUM (Not implemented)

UUID (Not Implemented)

BYTE_ARRAY

TEXT ENCODING DICT

Yes

Yes

Yes

Yes

TEXT ENCODING (16)

Yes

Yes

Yes

Yes

TEXT ENCODING (8)

Yes

Yes

Yes

Yes

TEXT ENCODING NONE

Yes

Yes

Yes

Yes

GeoTypes

Only geometry types in WKT format are currently supported

HeavyDB (Down) \ Parquet (Right)

STRING

BYTE_ARRAY

POINT

Yes

Yes

MULTIPOINT

Yes

Yes

LINESTRING

Yes

Yes

MULTILINESTRING

Yes

Yes

POLYGON

Yes

Yes

MULTIPOLYGON

Yes

Yes

Array Types

HeavyDB array data type maps to the Parquet list data type.

Last updated