Best Practices
When starting a new foreign table with a large dataset, use directories to help organize the data.
When starting a new foreign table with a large dataset, using symlinks, directories, manual refresh, and REFRESH_UPDATE_TYPE=’APPEND’ allows you to start visualizing data quickly while additional data is being processed and brought into the system. Manual refresh allows you to control the updates as you build out the system.
When scheduling the REFRESH_INTERVAL, set the time when there is less HeavyDB usage, especially when using REFRESH_UPDATE_TYPE=ALL, to minimize performance impacts during metadata updates.
When using compressed CSV, use smaller row counts in the compressed file (<1 million rows). This allows for more efficient parallel processing.
When using Parquet format, ensure that FileMetaData is generated as part of the data. This is required for HeavyConnect Parquet support.
When using Parquet format, use reasonably larger sized row groups. Smaller row groups will affect performance. The following is guidance from Apache Parquet (https://parquet.apache.org/documentation/latest/):
Row group size: Larger row groups allow for larger column chunks, which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two-pass write). We recommend large row groups (512MB - 1GB). Because an entire row group might need to be read, it should completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. For example, an optimized read setup would be 1GB row groups, 1GB HDFS block size, and 1 HDFS block per HDFS file.