An Overview of OmniSci Integrated Data Science Foundation
HEAVY.AI provides an integrated data science foundation built on several open-source components of the PyData stack. This set of tools is integrated with Heavy Immerse and allows users to switch from dashboards to an integrated notebook environment connected to HeavyDB in the background. You can switch from visual data exploration with Immerse to a deeper dive on a specific dataset, build predictive models using standard python-based data science libraries and tools, and push results back into HeavyDB for use with Immerse.
Several components make up the HEAVY.AI data science foundation.
HEAVY.AI provides deep integration with JupyterLab, the next-generation version of the most popular notebook environment and workflow used by data scientists for interactive computing. You can access JupyterLab by clicking an icon in Immerse.
JupyterLab access in Immerse
JupyterLab access from SQLEditor
In addition to the seamless integration with Immerse, you can also use JupyterLab with HEAVY.AI by creating an explicit connection object, either via the heavyai API.
>>> from heavyai import connect
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
The heavyai client interface provides a Python DB API 2.0-compliant HEAVY.AI interface. In addition, it provides methods to get results in the Apache Arrow-based GDF format for efficient data interchange.
Step 6: Iterate over the cursor, returning a list of tuples of values
>>> result =list(c)
Select Data into a GpuDataFrame Provided by pygdf
Step 1: Create a connection to local HEAVY.AI instance
>>>from heavyai import connect
>>> con = connect(user="heavyai", password="HyperInteractive", host="localhost",
Step 2: Query GpuDataFrame database table of flight departure and arrival delay times
>>> query ="SELECT depdelay, arrdelay FROM flights_2008_10k limit 100"
>>> df = con.select_ipc_gpu(query)
Step 3: Display results
Remote Backend Compiler (RBC)
Using Python, you can interact with databases in multiple ways. Libraries like SQLAlchemy provide a translation mechanism that converts Python to SQL; this is an example of an ORM (Object-Relational Mapping). With SQLAlchemy and similar approaches, user interactions with the database are simplified—and optimized—as a set of high-level functions provided by the ORM. Unfortunately, to run tasks not supported by the ORM, you need to write SQL code.
You can define your own SQL functions in HeavyDB, but to realize the full power of HeavyDB, you have to re-compile the engine to add your functions. To write GPU-compatible functions to execute on GPUs, HeavyDB supports User Defined Functions (UDFs) and User Defined Table Functions (UDTFs). A UDF operates on elements of tables; a UDTF operates on an entire table itself.
The Remote Backend Compiler (RBC) package provides a Python interface to define UDFs and UDTFs easily. Any UDF or UDTF written in Python can be registered at run time on the HeavyDB server and subsequently used in any SQL query by any client.
User define function schematic. Decorate a Python function to be able to call it with SQL.
Functions are not persisted on the database and need to be registered if the server is restarted.
Internally, the RBC converts the Python function to an intermediate representation (IR), which is then sent to the server. The IR is compiled on a CPU or a GPU, depending on specified hardware resources .
Ibis is an ORM that supports defining UDFs in C++ for some type of databases. However, it doesn’t provide a Python interface.
Ibis is a productivity API for working in Python and analyzing data in remote SQL-based data stores such as HeavyDB. Inspired by the pandas toolkit for data analysis, Ibis provides a Pythonic API that compiles to SQL. Combined with HeavyDB scale and speed, Ibis offers a familiar but more powerful method for analyzing very large datasets "in-place."uh b
Ibis supports multiple SQL databases backends, and also supports pandas as a native backend. Combined with Altair, this integration allows you to explore multiple datasets across different data sources.
Altair is another key component of the HEAVY.AI data science foundation. Building on the same Vega data visualization engine used by Immerse for geospatial charts, Altair provides a pythonic API over Vega-Lite, a subset of the full Vega specification for declarative charting based on the "Grammar of Graphics" paradigm. The HEAVY.AI data science foundation goes further and includes interface code to enable Altair to transparently use Ibis expressions instead of pandas data frames. This allows data visualization over much larger datasets in HEAVY.AI without writing SQL code.
The Nvidia RAPIDs toolkit is a collection of foundational libraries for GPU-accelerated data science and machine learning. It includes popular algorithms for clustering, classification, and linear models, as well as a GPU-based dataframe (cudf). HEAVY.AI allows configurable output to cudf from any query (including via Ibis or pyomnisci), so you can quickly run machine-learning algorithms on top of query results from HEAVY.AI.
Other Tools and Utilities
In addition, the data science foundation Docker container includes Facebook's Prophet library for forecasting, and Prefect, a lightweight but powerful workflow engine that enables you to build and manage workflows in Python.