Documentation

Documentation

    ›User Guide

    Overview

    • About Lentiq
    • Lentiq introduction
    • What is a Data Pool?
    • What is a Project?
    • Lentiq Architecture

    Getting started

    • Deploying on GCP
    • Deploying on AWS
    • Creating a data pool
    • Upload data to Lentiq
    • Deploy applications and clusters
    • How to connect a notebook to Spark?
    • Publishing notebooks

    User Guide

    • Data Management
    • Applications
    • Sharing data between data pools
    • Glossary

    Lentiq applications

    Lentiq provides a list of applications or application clusters that can be deployed automatically in a project. Applications are supervised and will be restarted in case of failure. All curated applications are deployed in containers and are using the pool of resources allocated to a project through the project quota property.

    Application clusters can typically be:

    • independently scaled horizontally and vertically, depending on each applications’ architecture
    • provisioned and scaled to zero at any point in time to minimize the costs when applications are not utilized
    • multiple instances of the same application can be used in the data pool project so that each member of your team has complete freedom
    • instantly started, they have a very low boot time

    By default applications are easily interconnectable. One user can easily connect a Jupyter Notebook to a specific Spark cluster, perform data analysis at scale, and shut down the Spark cluster when it is not needed anymore. In addition, users can interact with our data and metadata management layer through Spark by creating tables, SQL-based representations of the files, stored in the data lake. Once a table is created in a project, it will automatically be exposed in our Table Browser, and users can add documentation around it.

    On-premises or cloud based BI and visualization tools can easily be connected to data stored in the data lake through the Spark Thift Server based JDBC connector.

    Jupyter

    This is the de-facto notebook technology for data scientists and it is integrated with the Spark engine as well as the standard NumPy, Ray, Dash, Seaborn, Scikit-learn, SciPy, matplotlib and other tools for the Python and Scala programming languages.

    Apache Spark

    A pay-per-use, fully managed, large scale, in-memory data processing service capable of machine learning and graph processing that leverages Apache Spark. In each project, one can provision multiple, independent, smaller clusters for separate jobs, to simplify management in multi-tenant environments.

    SparkSQL

    SparkSQL is a service that allows industry standard JDBC and ODBC connectivity for business intelligence tools to data coming from a variety of sources and Spark programs. It can allow users to seamlessly connect with Tableau, Looker, QlikView, Zoomdata or PowerBI.

    Kafka

    Is a fast, scalable, queuing system, designed as an intermediate layer between producers and consumers of data. It can be used to bring data into Lentiq by connecting it via Spark Streaming jobs.

    SFTPProxy

    SFTPProxy is a service that allows you to easily import high-volume data into a project within the data lake.

    PostgreSQL

    Is an open source database especially designed to store structured data.

    Streamsets

    Is an open source data ingestion and transformation engine that streamlines data integration tasks.

    ← Data ManagementSharing data between data pools →
    • Jupyter
    • Apache Spark
    • SparkSQL
    • Kafka
    • SFTPProxy
    • PostgreSQL
    • Streamsets
    Copyright © 2019 Lentiq