# Data Science Foundation

HEAVY.AI provides an integrated data science foundation built on several open-source components of the PyData stack. This set of tools is integrated with Heavy Immerse and allows users to switch from dashboards to an integrated notebook environment connected to HeavyDB in the background. You can switch from visual data exploration with Immerse to a deeper dive on a specific dataset, build predictive models using standard python-based data science libraries and tools, and push results back into HeavyDB for use with Immerse.

Several components make up the HEAVY.AI data science foundation.

## JupyterLab

HEAVY.AI provides deep integration with JupyterLab, the next-generation version of the most popular notebook environment and workflow used by data scientists for interactive computing. You can access JupyterLab by clicking an icon in Immerse.

![JupyterLab access in Immerse](/files/kMXKtqgU8XaAmGp4nmmX)

![JupyterLab access from SQLEditor](/files/2Pqf4pGI7x4z3A87FsHh)

In addition to the seamless integration with Immerse, you can also use JupyterLab with HEAVY.AI by creating an explicit connection object, either via the [heavyai](https://heavyai.readthedocs.io/en/latest/) API.

```
>>> from heavyai import connect
>>> con = connect(user="admin", password="HyperInteractive", host="localhost",
...               dbname="heavyai")
>>> con
Connection(mapd://admin:***@localhost:6274/HEAVY.AI?protocol=binary)
```

or via the [Ibis-heavyai](https://github.com/heavyai/ibis-heavyai) API, which builds on [heavyai](https://heavyai.readthedocs.io/en/latest/).

```
con = ibis.heavyai.connect(
    host='localhost',
    database='ibis_testing',
    user='admin',
    password='HyperInteractive',
)
```

For more information, see the [JupyterLab documentation](https://jupyterlab.readthedocs.io/en/stable/).

## heavyai

The `heavyai` client interface provides a Python DB API 2.0-compliant HEAVY.AI interface. In addition, it provides methods to get results in the Apache Arrow-based GDF format for efficient data interchange.

### Documentation

See the GitHub [heavyai repository](https://github.com/heavyai/heavyai) and for documentation:

* [Installation](https://heavyai.readthedocs.io/en/latest/usage.html#installing-heavyai)
* [Getting started](https://heavyai.readthedocs.io/en/latest/usage.html)
* [API reference](https://heavyai.readthedocs.io/en/latest/api.html)

### Examples

#### Create a Cursor and Execute a Query

**Step 1: Create a connection**

```python
>>> from heavyai import connect
>>> con = connect(user="heavyai", password= "HyperInteractive", host="my.host.com", dbname="heavyai")
```

**Step 2: Create a cursor**

```python
>>> c = con.cursor()
>>> c
```

**Step 3: Query database table of flight departure and arrival delay times**

```python
>>> c.execute("SELECT depdelay, arrdelay FROM flights LIMIT 100")
```

**Step 4: Display number of rows returned**

```python
>>> c.rowcount
100
```

**Step 5: Display the Description objects list**

The list is a named tuple with attributes required by the specification. There is one entry per returned column, and we fill the `name`, `type_code`, and `null_ok` attributes.

```python
>>> c.description
[Description(name=u'depdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True), Description(name=u'arrdelay', type_code=0, display_size=None, internal_size=None, precision=None, scale=None, null_ok=True)]
```

**Step 6: Iterate over the cursor, returning a list of tuples of values**

```python
>>> result = list(c)
>>> result[:5]
[(1, 14), (2, 4), (5, 22), (-1, 8), (-1, -2)]
```

#### Select Data into a GpuDataFrame Provided by pygdf

**Step 1: Create a connection to local HEAVY.AI instance**

```python
>>> from heavyai import connect
>>> con = connect(user="heavyai", password="HyperInteractive", host="localhost",
...               dbname="heavyai")
```

**Step 2: Query GpuDataFrame database table of flight departure and arrival delay times**

```python
>>> query = "SELECT depdelay, arrdelay FROM flights_2008_10k limit 100"
>>> df = con.select_ipc_gpu(query)
```

**Step 3: Display results**

```python
>>> df.head()
  depdelay arrdelay
0       -2      -13
1       -1      -13
2       -3        1
3        4       -3
4       12        7
```

## Remote Backend Compiler (RBC)

Using Python, you can interact with databases in multiple ways. Libraries like [SQLAlchemy](https://www.sqlalchemy.org/) provide a translation mechanism that converts Python to SQL; this is an example of an [ORM (Object-Relational Mapping)](https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping). With SQLAlchemy and similar approaches, user interactions with the database are simplified—and optimized—as a set of high-level functions provided by the ORM. Unfortunately, to run tasks not supported by the ORM, you need to write SQL code.

You can define your own SQL functions in HeavyDB, but to realize the full power of HeavyDB, you have to re-compile the engine to add your functions. To write GPU-compatible functions to execute on GPUs, HeavyDB supports *User Defined Functions* (UDFs) and *User Defined Table Functions* (UDTFs). A UDF operates on elements of tables; a UDTF operates on an entire table itself.

The [Remote Backend Compiler (RBC)](https://rbc.readthedocs.io/en/latest) package provides a Python interface to define UDFs and UDTFs easily. Any UDF or UDTF written in Python can be registered at run time on the HeavyDB server and subsequently used in any SQL query by any client.

![User define function schematic. Decorate a Python function to be able to call it with SQL.](/files/zZ4oQlaB3z9okMqkDvaI)

{% hint style="info" %}
Functions are not persisted on the database and need to be registered if the server is restarted.
{% endhint %}

Internally, the RBC converts the Python function to an *intermediate representation* (IR), which is then sent to the server. The IR is compiled on a CPU or a GPU, depending on specified hardware resources .

{% hint style="info" %}
[Ibis](https://ibis-project.org/docs/index.html) is an ORM that supports defining UDFs in C++ for some type of databases. However, it doesn’t provide a Python interface.
{% endhint %}

## Ibis

Ibis is a productivity API for working in Python and analyzing data in remote SQL-based data stores such as HeavyDB. Inspired by the pandas toolkit for data analysis, Ibis provides a Pythonic API that compiles to SQL. Combined with HeavyDB scale and speed, Ibis offers a familiar but more powerful method for analyzing very large datasets "in-place."uh b

Ibis supports multiple SQL databases backends, and also supports pandas as a native backend. Combined with Altair, this integration allows you to explore multiple datasets across different data sources.

![](/files/MuX6hLt0ue4jvgXDyLSq)

## Altair

[Altair](https://altair-viz.github.io/) is another key component of the HEAVY.AI data science foundation. Building on the same [Vega](https://vega.github.io/) data visualization engine used by Immerse for geospatial charts, Altair provides a pythonic API over [Vega-Lite](https://vega.github.io/vega-lite/docs/), a subset of the full Vega specification for declarative charting based on the "Grammar of Graphics" paradigm. The HEAVY.AI data science foundation goes further and includes interface code to enable Altair to transparently use Ibis expressions instead of pandas data frames. This allows data visualization over much larger datasets in HEAVY.AI without writing SQL code.

![](/files/GocIpdIJluz1J1to4Ovn)

## NVIDIA RAPIDs <a href="#altair" id="altair"></a>

The [Nvidia RAPIDs](https://rapids.ai/) toolkit is a collection of foundational libraries for GPU-accelerated data science and machine learning. It includes popular algorithms for clustering, classification, and linear models, as well as a GPU-based dataframe ([cudf](https://github.com/rapidsai/cudf)). HEAVY.AI allows configurable output to cudf from any query (including via Ibis or pyomnisci), so you can quickly run machine-learning algorithms on top of query results from HEAVY.AI.

![](/files/D7iC1tXwljK6wSfxhQiu)

## Other Tools and Utilities <a href="#altair-1" id="altair-1"></a>

In addition, the data science foundation Docker container includes Facebook's [Prophet](https://facebook.github.io/prophet/) library for forecasting, and [Prefect](https://www.prefect.io/), a lightweight but powerful workflow engine that enables you to build and manage workflows in Python.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.heavy.ai/python-data-science/omnisci-data-science-foundation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.