Documentation

Documentation

  • Home
  • Blog
  • API
  • Contact

›User Guide

Overview

  • Lentiq introduction
  • Lentiq architecture
  • What is a data pool?
  • What is a project?
  • Migrating from Hadoop

Getting started

  • Deploying applications and processing clusters
  • Connecting to Spark from a notebook
  • Uploading data to Lentiq
  • Creating a data pool
  • Deploying on GCP
  • Deploying on AWS

User Guide

    Managing applications

    • Working with applications
    • Managing compute resources

    Managing data

    • Working with data and metadata
    • Sharing data between data pools
    • Querying data with SQL (DataGrip)
    • Connecting Tableau to Lentiq

    Managing models

    • Working with models
    • Publishing notebooks
    • Training and serializing a model
    • Managing model servers

    Managing workflows

    • Working with workflows
    • Creating a reusable code block from a notebook
    • Creating a docker image based reusable code block
  • Glossary
  • API

Tutorials

  • End-to-end Machine Learning Tutorial

Querying data with SQL (DataGrip)

Querying data using SQL is a basic but fundamental use of any data lake. Lentiq is compatible with most JDBC/ODBC compatible tools and uses Apache Spark's query engine.

sparksql jdbc connector

The data is stored in parquet format in the object storage, the schema is stored a metastore database that is linked to Lentiq's meta data management system.

The query engine is SparkSQL which uses Spark's in-memory mechanisms and query planner to execute SQL queries on data.

1. Deploy the Spark SQL Application

From the Lentiq's left-hand application panel click on the SparkSQL icon.

sparksql jdbc connector

Click Create Spark SQL.

2. Download JDBC drivers

SparkSQL is compatible with Apache Hive's JDBC connector version 1.x. It also has a Hadoop-core dependency that does not come with it.

mkdir ~/jdbc-drivers #you can put these anywhere 
cd ~/jdbc-drivers
curl -O http://central.maven.org/maven2/org/apache/hive/hive-jdbc/1.2.1/hive-jdbc-1.2.1-standalone.jar
curl -O http://central.maven.org/maven2/org/apache/hadoop/hadoop-core/1.2.1/hadoop-core-1.2.1.jar

3. Configure your BI tool to use the JDBC drivers

The JDBC connectors should work with all JDBC compatible clients. For the purpose of this demonstration we're going to use Jetbrains's excelent DataGrip.

  1. Click on the "+" sign and select "Driver":

select driver

  1. Click on the "+" sign from "Driver files" and add both jars

add jdbc driver files

  1. Rename the driver to "Hive 1.2.1"

  2. Change the class to "org.apache.hive.jdbc.HiveDriver"

  3. On the Options select the Apache Spark option in both Dialect and Icon dropdowns.

add jdbc driver files

  1. Click OK

4. Create a database connection

  1. Click on the SparkSQL's Edit button and copy the JDBC URL. It should be in the form jdbc:hive2://34.66.161.32:10000

add jdbc driver files

  1. In the same dialogue, at the Firewall tab make sure your IP is white listed from your current location.

  2. In DataGrip, click the "+" sign and add a Data Source by selecting the newly added Hive 1.2.1.

  3. Paste the URL in the "URL" field. Click Test Connection.

add jdbc driver files

  1. If the test was successful click OK. If not, check your firewall settings at step 2.

You have now configured a data source.

5. Query the datasets

  1. Execute some queries on the connection:
SHOW TABLES

add jdbc driver files

SELECT * FROM haireyecolor

add jdbc driver files

Performance considerations

SparkSQL scales horizontally so if the performance is not satisfactory add more workers from SparkSQL's Configuration Tab. Depending on your use case you might need to add more RAM to support more complex joins.

Creating tables

Tables are created either through an import process using a Reusable Code Block, or created via a Jupyter notebook. In both situations they need to be "registered" in the metastore. To do so you need to execute Spark's saveAsTable() function.

Another option is to directly create the tables from external files (such as parquet or CSV) from the external SQL tool.

For example to create a new table execute a CREATE TABLE ... LOCATION command.

CREATE TABLE IF NOT EXISTS iris_dataset(`sl` double,`sw` double,`pl` double, `pw` double )
USING PARQUET
LOCATION '/datasets/iris-dataset.parquet'

To query the newly created table:

SELECT * from iris_dataset

add jdbc driver files

← Sharing data between data poolsConnecting Tableau to Lentiq →
  • 1. Deploy the Spark SQL Application
  • 2. Download JDBC drivers
  • 3. Configure your BI tool to use the JDBC drivers
  • 4. Create a database connection
  • 5. Query the datasets
  • Performance considerations
  • Creating tables
Copyright © 2019 Lentiq