Loading Data with SQL
This topic describes several ways to load data to OmniSci using SQL commands.
- If there is a potential for duplicate entries, and you want to avoid loading duplicate rows, see How can I avoid creating duplicate rows? on the Troubleshooting page.
- If a source file uses a reserved word, OmniSci automatically adds an underscore at the end of the reserved word. For example,
year
is converted toyear_
.
Use the following syntax for CSV and TSV files:
COPY <table> FROM '<file pattern>' [WITH (<property> = value, ...)];
<file pattern>
must be local on the server. The file pattern can contain wildcards if you want to load multiple files. In addition to CSV, TSV, and TXT files, you can import compressed files in TAR, ZIP, 7-ZIP, RAR, GZIP, BZIP2, or TGZ format.COPY FROM
appends data from the source into the target table. It does not truncate the table or overwrite existing data.You can import client-side files (
\copy
command in omnisql
) but it is significantly slower. For large files, OmniSci recommends that you first scp
the file to the server, and then issue the COPY command.OmniSci supports Latin-1 ASCII format and UTF-8. If you want to load data with another encoding (for example, UTF-16), convert the data to UTF-8 before loading it to OmniSci.
Available properties in the optional WITH clause are described in the following table.
Parameter | Description | Default Value |
array_delimiter | A single-character string for the delimiter between input values contained within an array. | , (comma) |
array_marker | A two-character string consisting of the start and end characters surrounding an array. | { } (curly brackets). For example, data to be inserted into a table with a string array in the second column (for example, BOOLEAN, STRING[], INTEGER ) can be written as true,{value1,value2,value3},3 |
buffer_size | Size of the input file buffer, in bytes. | 8388608 |
delimiter | A single-character string for the delimiter between input fields; most commonly:
Other delimiters include | ,~ , ^ , and; .Note: OmniSci does not use file extensions to determine the delimiter. | ',' (CSV file) |
escape | A single-character string for escaping quotes. | '"' (double quote) |
header | Either 'true' or 'false' , indicating whether the input file has a header line in Line 1 that should be skipped. | 'true' |
line_delimiter | A single-character string for terminating each line. | '\n' |
lonlat | In OmniSci, POINT fields require longitude before latitude. Use this parameter based on the order of longitude and latitude in your source data. | 'true' |
max_reject | Number of records that the COPY statement allows to be rejected before terminating the COPY command. Records can be rejected for a number of reasons, including invalid content in a field, or an incorrect number of columns. The details of the rejected records are reported in the ERROR log. COPY returns a message identifying how many records are rejected. The records that are not rejected are inserted into the table, even if the COPY stops because the max_reject count is reached.Note: If you run the COPY command from OmniSci Immerse, the COPY command does not return messages to Immerse once the SQL is verified. Immerse does not show messages about data loading, or about data-quality issues that result in max_reject triggers. | 100,000 |
nulls | A string pattern indicating that a field is NULL. | An empty string, 'NA' , or \N |
parquet | Import data in Parquet format. Parquet files can be compressed using Snappy. Other archives such as .gz or .zip must be unarchived before you import the data. | 'false' |
plain_text | Indicates that the input file is plain text so that it bypasses the libarchive decompression utility. | CSV, TSV, and TXT are handled as plain text. |
quote | A single-character string for quoting a field. | " (double quote). All characters inside quotes are imported “as is,” except for line delimiters. |
quoted | Either 'true' or 'false' , indicating whether the input file contains quoted fields. | 'true' |
source_srid | When importing into GEOMETRY(*, 4326) columns, specifies the SRID of the incoming geometries, all of which are transformed on the fly.
For example, to import from a file that contains EPSG:2263 (NAD83 / New York Long Island) geometries, run the COPY command and include WITH (source_srid=2263). Data targeted at non-4326 geometry columns is not affected. | 0 |
threads | Number of threads for performing the data import. | Number of CPU cores on the system |
By default, the CSV parser assumes one row per line. To import a file with multiple lines in a single field, specify
threads = 1
in the WITH
clause.COPY tweets FROM '/tmp/tweets.csv' WITH (nulls = 'NA');
COPY tweets FROM '/tmp/tweets.tsv' WITH (delimiter = '\t', quoted = 'false');
COPY tweets FROM '/tmp/*' WITH (header='false');
COPY trips FROM '/mnt/trip/trip.parquet/part-00000-0284f745-1595-4743-b5c4-3aa0262e4de3-c000.snappy.parquet' with (parquet='true');
You can use
COPY FROM
to import geo files. You can create the table based on the source file and then load the data:COPY tableName FROM 'source' WITH (geo='true', ...);
You can also append data to an existing, predefined table:
COPY tableName FROM 'source' WITH (geo='true', ...);
Use the following syntax, depending on the file source.
Source | Syntax | |
Local server | COPY [tableName] FROM '/ filepath ' WITH (geo='true', ...) ; | |
Web site | COPY [tableName] FROM '[ http ` | _ https_ ]://_ website/filepath_ ' WITH (geo='true', ...);` |
Amazon S3 | COPY [tableName] FROM 's3:// bucket/filepath ' WITH (geo='true', s3_region=' region ', s3_access_key=' accesskey ', s3_secret_key=' secretkey ', ... ); | |
- If you are using
COPY FROM
to load to an existing table, the field type must match the metadata of the source file. If it does not,COPY FROM
throws an error and does not load the data. COPY FROM
appends data from the source into the target table. It does not truncate the table or overwrite existing data.- Supported
DATE
formats when usingCOPY FROM
includemm/dd/yyyy
,dd-mmm-yy
,yyyy-mm-dd
, anddd/mmm/yyyy
. COPY FROM
fails for records with latitude or longitude values that have more than 4 decimal places.
The following
WITH
options are available for geo file imports from all sources.Parameter | Description | Default Value |
geo_assign_render_groups | Enable or disable automatic render group assignment for polygon imports; can be true or false . If polygons are not needed for rendering, set this to false to speed up import. | true |
geo_coords_type | Coordinate type used; must be geography . | N/A |
geo_coords_encoding | Coordinates encoding; can be geoint(32) or none . | geoint(32) |
geo_coords_srid | Coordinates spatial reference; must be 4326 (WGS84 longitude/latitude). | N/A |
geo_explode_collections | Explodes MULTIPOLYGON, MULTILINESTRING, or MULTIPOINT geo data into multiple rows in a POLYGON, LINESTRING, or POINT column, with all other columns duplicated. When importing from a WKT CSV with a MULTIPOLYGON column, the table must have been manually created with a POLYGON column. Similarly, a MULTILINESTRING column must be created with LINESTRING, and a MULTIPOINT column must be created with POINT. Storing MULTILINESTRING or MULTIPOINT directly is not supported, but can be exploded into a LINESTRING or POINT column. When importing from a geo file, the table is automatically created with the correct type of column. When the input column contains a mixture of MULTI and single geo, the MULTI geo are exploded, but the singles are imported normally. For example, a column containing five two-polygon MULTIPOLYGON rows and five POLYGON rows imports as a POLYGON column of fifteen rows. | false |
Currently, a manually created geo table can have only one geo column. If it has more than one, import is not performed.
The following file types are supported:
- ESRI Shapefile (.shp)
- ESRI file geodatabase (.gdb)
- GeoJSON (.geojson, .json, .geojson.gz, .json.gz)
- KML (.kml, kmz)
- File bundles:
- .zip
- .tar
- .tar.gz
- .tgz
An ESRI file geodatabase can have multiple layers, and importing it results in the creation of one table for each layer in the file. This behavior differs from that of importing shapefiles, GeoJSON, or KML files, which results in a single table. For more information, see Importing an ESRI File Geodatabase.
The first compatible file (.shp, .gdb, .geojson, .kml) in the bundle is loaded; subfolders are traversed until a compatible file is found. The rest of the contents in the bundle are ignored. If the bundle contains multiple filesets, unpack the file manually and specify it for import.
CSV files containing WKT strings are not considered geo files and should not be parsed with the
geo='true'
option. When importing WKT strings from CSV files, you must create the table first. The geo column type and encoding are specified as part of the DDL. For example, for a polygon with no encoding, try the following:ggpoly GEOMETRY(POLYGON, 4326) ENCODING COMPRESSED(32)
SQLImporter is a Java utility run at the command line. It runs a SELECT statement on another database through JDBC and loads the result set into OmniSciDB.
java -cp [OmniSci utility jar file]:[3rd party JDBC driver]
SQLImporter
-u <userid>; -p <password>; [(--binary|--http|--https [--insecure])]
-s <omnisci server host> -db <omnsci db> --port <omnisci server port>
[-d <other database JDBC drive class>] -c <other database JDBC connection string>
-su <other database user> -sp <other database user password> -ss <other database sql statement>
-t <OmniSci target table> -b <transfer buffer size> -f <table fragment size>
[-tr] [-nprg] [-adtf] -i <init commands file>
-r Row load limit
-h,--help Help message
-r <arg>; Row load limit
-h,--help Help message
-u,--user <arg>; OmniSci user
-p,--passwd <arg>; OmniSci password
--binary Use binary transport to connect to OmniSci
--http Use http transport to connect to OmniSci
--https Use https transport to connect to OmniSci
-s,--server <arg>; OmniSci Server
-db,--database <arg>; OmniSci Database
--port <arg>; OmniSci Port
--ca-trust-store <arg>; CA certificate trust store
--ca-trust-store-passwd <arg>; CA certificate trust store password
--insecure <arg>; Inseure TLS - Do not validate server OmniSci server credentials
-d,--driver <arg>; JDBC driver class
-c,--jdbcConnect <arg>; JDBC connection string
-su,--sourceUser <arg>; Source user
-sp,--sourcePasswd <arg>; Source password
-ss,--sqlStmt <arg>; SQL Select statement
-t,--targetTable <arg>; OmniSci Target Table
-b,--bufferSize <arg>; Transfer buffer size
-f,--fragmentSize <arg>; Table fragment size
-tr,--truncate Truncate table if it exists
-nprg,--noPolyRenderGroups Disable render group assignment
-adtf,--allowDoubleToFloat Allow narrow casting
-i,--initializeFile <arg>; File containing init command for DB
OmniSci recommends that you use a service account with read-only permissions when accessing data from a remote database.
In release 4.6 and higher, the user ID (
-u
) and password (-p
) flags are required. If your password includes a special character, you must escape the character using a backslash (\).If the table does not exist in OmniSciDB,
SQLImporter
creates it. If the target table in OmniSciDB does not match the SELECT statement metadata, SQLImporter
fails.If the truncate flag is used,
SQLImporter
truncates the table in OmniSciDB before transferring the data. If the truncate flag is not used, SQLImporter
appends the results of the SQL statement to the target table in OmniSciDB.The
-i
argument provides a path to an initialization file. Each line of the file is sent as a SQL statement to the remote database. You can use -i
to set additional custom parameters before the data is loaded.The
SQLImporter
string is case-sensitive. Incorrect case returns the following:Error: Could not find or load main class com.mapd.utility.SQLimporter
You can migrate geo data types from a PostgreSQL database. The following table shows the correlation between PostgreSQL/PostGIS geo types and OmniSci geo types.
PostgreSQL/PostGIS Type | OmniSci Type |
point | point |
lseg | linestring |
linestring | linestring |
polygon | polygon |
multipolygon | multipolygon |
Other PostgreSQL types, including circle, box, and path, are not supported.
java -cp /opt/omnisci/bin/omnisci-utility-5.6.0.jar
com.mapd.utility.SQLImporter -u admin -p HyperInteractive -db omnisci --port 6274
-t mytable -su admin -sp HyperInteractive -c "jdbc:omnisci:myhost:6274:omnisci"
-ss "select * from mytable limit 1000000000"
By default, 100,000 records are selected from OmniSciDB. To select a larger number of records, use the LIMIT statement.
java -cp /opt/omnisci/bin/omnisci-utility-5.6.0.jar:/hive-jdbc-1.2.1000.2.6.1.0-129-standalone.jar
com.mapd.utility.SQLImporter
-u user -p password
-db OmniSci_database_name --port 6274 -t OmniSci_table_name
-su source_user -sp source_password
-c "jdbc:hive2://server_address:port_number/database_name"
-ss "select * from source_table_name"
java -cp /opt/omnisci/bin/omnisci-utility-5.6.0.jar:./GoogleBigQueryJDBC42.jar:
./google-oauth-client-1.22.0.jar:./google-http-client-jackson2-1.22.0.jar:./google-http-client-1.22.0.jar:./google-api-client-1.22.0.jar:
./google-api-services-bigquery-v2-rev355-1.22.0.jar
com.mapd.utility.SQLImporter
-d com.simba.googlebigquery.jdbc42.Driver
-u user -p password
-db OmniSci_database_name --port 6274 -t OmniSci_table_name
-su source_user -sp source_password
-c "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=project-id;OAuthType=0;
[email protected];OAuthPvtKeyPath=/home/simba/myproject.json;"
-ss "select * from schema.source_table_name"
java -cp /opt/omnisci/bin/omnisci-utility-5.6.0.jar:/tmp/postgresql-42.2.5.jar
com.mapd.utility.SQLImporter
-u user -p password
-db OmniSci_database_name --port 6274 -t OmniSci_table_name
-su source_user -sp source_password
-c "jdbc:postgresql://127.0.0.1/postgres"
-ss "select * from schema_name.source_table_name"
java -cp /opt/omnisci/bin/omnisci-utility-5.6.0.jar:/path/sqljdbc4.jar
com.mapd.utility.SQLImporter
-d com.microsoft.sqlserver.jdbc.SQLServerDriver
-u user -p password
-db OmniSci_database_name --port 6274 -t OmniSci_table_name
-su source_user -sp source_password
-c "jdbc:sqlserver://server:port;DatabaseName=database_name"
-ss "select top 10 * from dbo.source_table_name"
java -cp /opt/omnisci/bin/omnisci-utility-5.6.0.jar:mysql/mysql-connector-java-5.1.38-bin.jar
com.mapd.utility.SQLImporter
-u user -p password
-db OmniSci_database_name --port 6274 -t OmniSci_table_name
-su source_user -sp source_password
-c "jdbc:mysql://server:port/DatabaseName=database_name"
-ss "select * from schema_name.source_table_name"
Stream data into OmniSciDB by attaching the StreamInsert program to the end of a data stream. The data stream can be another program printing to standard out, a Kafka endpoint, or any other real-time stream output. You can specify the appropriate batch size, according to the expected stream rates and your insert frequency. The target table must exist before you attempt to stream data into the table.
<data stream> | StreamInsert <table name> <database name> \
{-u|--user} <user> {-p|--passwd} <password> [{--host} <hostname>] \
[--port <port number>][--delim <delimiter>][--null <null string>] \
[--line <line delimiter>][--batch <batch size>][{-t|--transform} \
transformation ...][--retry_count <num_of_retries>] \
[--retry_wait <wait in secs>][--print_error][--print_transform]
Setting | Default | Description | |
<table_name> | n/a | Name of the target table in OmniSci | |
<database_name> | n/a | Name of the target database in OmniSci | |
-u | n/a | User name | |
-p | n/a | User password | |
--host | n/a | Name of OmniSci host | |
--delim | comma (,) | Field delimiter, in single quotes | |
--line | newline (\n) | Line delimiter, in single quotes | |
--batch | 10000 | Number of records in a batch | |
--retry_count | 10 | Number of attempts before job fails | |
--retry_wait | 5 | Wait time in seconds after server connection failure | |
--null | n/a | String that represents null values | |
--port | 6274 | Port number for OmniSciDB on localhost | |
`-t | --transform` | n/a | Regex transformation |
--print_error | False | Print error messages | |
--print_transform | False | Print description of transform. | |
--help | n/a | List options | |
cat file.tsv | /path/to/omnisci/SampleCode/StreamInsert stream_example \
omnisci --host localhost --port 6274 -u imauser -p imapassword \
--delim '\t' --batch 1000
You can use the SQL
COPY FROM
statement to import files stored on Amazon Web Services Simple Storage Service (AWS S3) into an OmniSci table, in much the same way you would with local files. In the WITH
clause, specify the S3 credentials and region information of the bucket accessed.COPY <table> FROM '<S3_file_URL>' WITH ([[s3_access_key = '<key_name>',s3_secret_key = '<key_secret>',] | [s3_session_token - '<AWS_session_token']] s3_region = '<region>');
Access key and secret key, or session token if using temporary credentials, and region are required. For information about AWS S3 credentials, see https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys.
OmniSci does not support the use of asterisks (*) in URL strings to import items. To import multiple files, pass in an S3 path instead of a file name, and
COPY FROM
imports all items in that path and any subpath.OmniSci supports custom S3 endpoints, which allows you to import data from S3-compatible services, such as Google Cloud Storage.
To use custom S3 endpoints, add
s3_endpoint
to the WITH
clause of a COPY FROM
statement; for example:COPY trips FROM 's3://omnisci-importtest-data/trip-data/trip_data_9.gz' WITH (header='true', s3_endpoint='storage.googleapis.com');
For information about interoperability and setup for Google Cloud Services, see Cloud Storage Interoperability.
You can also configure custom S3 endpoints by passing the
s3_endpoint
field to Thrift import_table
.omnisql> COPY trips FROM 's3://omnisci-s3-no-access/trip_data_9.gz';
Exception: failed to list objects of s3 url 's3://omnisci-s3-no-access/trip_data_9.gz': AccessDenied: Access Denied
omnisql> COPY trips FROM 's3://omnisci-s3-no-access/trip_data_9.gz' with (s3_access_key='xxxxxxxxxx',s3_secret_key='yyyyyyyyy');
Exception: failed to list objects of s3 url 's3://omnisci-s3-no-access/trip_data_9.gz': AuthorizationHeaderMalformed: Unable to parse ExceptionName: AuthorizationHeaderMalformed Message: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'us-west-1'
omnisql> COPY trips FROM 's3://omnisci-testdata/trip.compressed/trip_data_9.csv' with (s3_access_key='xxxxxxxx',s3_secret_key='yyyyyyyy',s3_region='us-west-1');
Result
Loaded: 100 recs, Rejected: 0 recs in 0.361000 secs
The following example imports all the files in the
trip.compressed
directory.omnisql> copy trips from 's3://omnisci-testdata/trip.compressed/' with (s3_access_key='xxxxxxxx',s3_secret_key='yyyyyyyy',s3_region='us-west-1');
Result
Loaded: 105200 recs, Rejected: 0 recs in 1.890000 secs
The table
trips
is created with the following statement:omnisql> \d trips
CREATE TABLE trips (
medallion TEXT ENCODING DICT(32),
hack_license TEXT ENCODING DICT(32),
vendor_id TEXT ENCODING DICT(32),
rate_code_id SMALLINT,
store_and_fwd_flag TEXT ENCODING DICT(32),
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
passenger_count SMALLINT,
trip_time_in_secs INTEGER,
trip_distance DECIMAL(14,2),
pickup_longitude DECIMAL(14,2),
pickup_latitude DECIMAL(14,2),
dropoff_longitude DECIMAL(14,2),
dropoff_latitude DECIMAL(14,2))
WITH (FRAGMENT_SIZE = 75000000);
You can configure OmniSci server to provide AWS credentials, which allows S3 Queries to be run without specifying AWS credentials.
Run the following to enable OmniSci server privileges:
./startomnisci --allow-s3-server-privileges
OR
./omnisci_server --allow-s3-server-privileges
S3 Regions are not configured by the server and may require specification on the client side.
Examples
\detect
:$ export AWS_REGION=us-west-1
omnisql > \detect <s3-bucket-uri
import_table
:$ ./OmniSci-remote -h localhost:6274 import_table "'<session-id>'" "<table-name>" '<s3-bucket-uri>' 'TCopyParams(s3_region="'us-west-1'")'
COPY FROM
:omnisql > COPY <table-name> FROM <s3-bucket-uri> WITH(s3_region='us-west-1');
Configure one of the following credential sources before running
omnisci_server
.Environment Variables
Credential Profile
IAM roles (EC2 instances only):
Set the following environment variables:
AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
.Specify a shared AWS credentials file and profile with the environment variable
AWS_SHARED_CREDENTIALS_FILE
(default: ~/.aws/credentials
) and AWS_PROFILE
(default: default
).Prerequisites
- 1.An IAM Policy that has sufficient access to the S3 bucket.
- 2.An IAM AWS Service Role of type
Amazon EC2
, which is assigned the IAM Policy from (1).
Setting Up an EC2 Instance with Roles
For a new EC2 Instance:
- 1.AWS Management Console > Services > Compute > EC2 > Launch Instance.
- 2.Select desired Amazon Machine Image (AMI) > Select.
- 3.Select desired Instance Type > Next: Configure Instance Details.
- 4.IAM Role > Select desired IAM Role > Review and Launch.
- 5.Review other options > Launch.
For an existing EC2 Instance:
- 1.AWS Management Console > Services > Compute > EC2 > Instances.
- 2.Mark desired instance(s) > Actions > Security > Modify IAM Role.
- 3.Select desired IAM Role > Save.
You can ingest data from an existing Kafka producer to an existing table in OmniSci using
KafkaImporter
on the command line:KafkaImporter <table_name> <database_name> {-u|--user <user_name> \
{-p|--passwd <user_password>} [{--host} <hostname>] \
[--port <OmniSciDB_port>] [--http] [--https] [--skip-verify] \
[--ca-cert <path>] [--delim <delimiter>] [--batch <batch_size>] \
[{-t|--transform} transformation ...] [retry_count <retry_number>] \
[--retry_wait <delay_in_seconds>] --null <null_value_string> [--quoted true|false] \
[--line <line_delimiter>] --brokers=<broker_name:broker_port> \
--group-id=<kafka_group_id> --topic=<topic_type> [--print_error] [--print_transform]
KafkaImporter
requires a functioning Kafka cluster. See the Kafka website and the Confluent schema registry documentation.Setting | Default | Description | |
<table_name> | n/a | Name of the target table in OmniSci | |
<database_name> | n/a | Name of the target database in OmniSci | |
-u <username> | n/a | User name | |
-p <password> | n/a | User password | |
--host <hostname> | localhost | Name of OmniSci host | |
--port <port_number> | 6274 | Port number for OmniSciDB on localhost | |
--http | n/a | Use HTTP transport | |
--https | n/a | Use HTTPS transport | |
--skip-verify | n/a | Do not verify validity of SSL certificate | |
--ca-cert <path> | n/a | Path to the trusted server certificate; initiates an encrypted connection | |
--delim <delimiter> | comma (,) | Field delimiter, in single quotes | |
--line <delimiter> | newline (\n) | Line delimiter, in single quotes | |
--batch <batch_size> | 10000 | Number of records in a batch | |
--retry_count <retry_number> | 10 | Number of attempts before job fails | |
--retry_wait <seconds> | 5 | Wait time in seconds after server connection failure | |
--null <string> | n/a | String that represents null values | |
--quoted <boolean> | false | Whether the source contains quoted fields | |
`-t | --transform` | n/a | Regex transformation |
--print_error | false | Print error messages | |
--print_transform | false | Print description of transform | |
--help | n/a | List options | |
--group-id <id> | n/a | Kafka group ID | |
--topic <topic> | n/a | The Kafka topic to be ingested | |
--brokers <broker_name:broker_port> | localhost:9092 | One or more brokers | |
KafkaImporter Logging Options | | |
Setting | Default | Description |
--log-directory <directory> | mapd_log | Logging directory; can be relative to data directory or absolute |
--log-file-name <filename> | n/a | Log filename relative to logging directory; has format KafkaImporter.{SEVERITY}.%Y%m%d-%H%M%S.log |
--log-symlink <symlink> | n/a | Symlink to active log; has format KafkaImporter.{SEVERITY} |
--log-severity <level> | INFO | Log-to-file severity level: INFO, WARNING, ERROR, or FATAL |
--log-severity-clog <level> | ERROR | Log-to-console severity level: INFO, WARNING, ERROR, or FATAL |
--log-channels | n/a | Log channel debug info |
--log-auto-flush | n/a | Flush logging buffer to file after each message |
--log-max-files <files_number> | 100 | Maximum number of log files to keep |
--log-min-free-space <bytes> | 20,971,520 | Minimum number of bytes available on the device before oldest log files are deleted |
--log-rotate-daily | 1 | Start new log files at midnight |
--log-rotation-size <bytes> | 10485760 | Maximum file size, in bytes, before new log files are created |
Configure
KafkaImporter
to use your target table. KafkaImporter
listens to a pre-defined Kafka topic associated with your table. You must create the table before using the KafkaImporter
utility. For example, you might have a table named customer_site_visit_events
that listens to a topic named customer_site_visit_events_topic
.The data format must be a record-level format supported by OmniSci.
KafkaImporter
listens to the topic, validates records against the target schema, and ingests topic batches of your designated size to the target table. Rejected records use the existing reject reporting mechanism. You can start, shut down, and configure KafkaImporter
independent of the OmniSciDB engine. If KafkaImporter is running and the database shuts down, KafkaImporter shuts down as well. Reads from the topic are nondestructive.KafkaImporter
is not responsible for event ordering; a streaming platform outside OmniSci (for example, Spark streaming, flink) should handle the stream processing. OmniSci ingests the end-state stream of post-processed events.KafkaImporter
does not handle dynamic schema creation on first ingest, but must be configured with a specific target table (and its schema) as the basis. There is a 1:1 correspondence between target table and topic.cat tweets.tsv | -./KafkaImporter tweets_small omnisci-u imauser-p imapassword--delim '\t'--batch 100000--retry_count 360--retry_wait 10--null null--port 9999--brokers=localhost:9092--group-id=testImport1--topic=tweet
cat tweets.tsv | ./KafkaImporter tweets_small omnisci
-u imauser
-p imapassword
--delim '\t'
--batch 100000
--retry_count 360
--retry_wait 10
--null null
--port 9999
--brokers=localhost:9092
--group-id=testImport1
--topic=tweet
StreamImporter is an updated version of the StreamInsert utility used for streaming reads from delimited files into OmniSciDB. StreamImporter uses a binary columnar load path, providing improved performance compared to StreamInsert.
You can ingest data from a data stream to an existing table in OmniSci using
StreamImporter
on the command line.StreamImporter <table_name> <database_name> {-u|--user <user_name> \
{-p|--passwd <user_password>} [{--host} <hostname>] [--port <OmniSciDB_port>] \
[--http] [--https] [--skipverify] [--ca-cert <path>] [--delim <delimiter>] \
[--null <null string>] [--line <line delimiter>] [--quoted <boolean>] \
[--batch <batch_size>] [{-t|--transform} transformation ...] \
[retry_count <number_of_retries>] [--retry_wait <delay_in_seconds>] \
[--print_error] [--print_transform]
Setting | Default | Description | |
<table_name> | n/a | Name of the target table in OmniSci | |
<database_name> | n/a | Name of the target database in OmniSci | |
-u <username> | n/a | User name | |
-p <password> | n/a | User password | |
--host <hostname> | n/a | Name of OmniSci host | |
--port <port> | 6274 | Port number for OmniSciDB on localhost | |
--http | n/a | Use HTTP transport | |
--https | n/a | Use HTTPS transport | |
--skip-verify | n/a | Do not verify validity of SSL certificate | |
--ca-cert <path> | n/a | Path to the trusted server certificate; initiates an encrypted connection | |
--delim <delimiter> | comma (,) | Field delimiter, in single quotes | |
--null <string> | n/a | String that represents null values | |
--line <delimiter> | newline (\n) | Line delimiter, in single quotes | |
--quoted <boolean> | true | Either true or false , indicating whether the input file contains quoted fields. | |
--batch <number> | 10000 | Number of records in a batch | |
--retry_count <retry_number> | 10 | Number of attempts before job fails | |
--retry_wait <seconds> | 5 | Wait time in seconds after server connection failure | |
`-t | --transform` | n/a | Regex transformation |
--print_error | false | Print error messages | |
--print_transform | false | Print description of transform |