Documentation

Documentation

  • Home
  • Blog
  • API
  • Contact

›Overview

Overview

  • Lentiq introduction
  • Lentiq architecture
  • What is a data pool?
  • What is a project?
  • Migrating from Hadoop

Getting started

  • Deploying applications and processing clusters
  • Connecting to Spark from a notebook
  • Uploading data to Lentiq
  • Creating a data pool
  • Deploying on GCP
  • Deploying on AWS

User Guide

    Managing applications

    • Working with applications
    • Managing compute resources

    Managing data

    • Working with data and metadata
    • Sharing data between data pools
    • Querying data with SQL (DataGrip)
    • Connecting Tableau to Lentiq

    Managing models

    • Working with models
    • Publishing notebooks
    • Training and serializing a model
    • Managing model servers

    Managing workflows

    • Working with workflows
    • Creating a reusable code block from a notebook
    • Creating a docker image based reusable code block
  • Glossary
  • API

Tutorials

  • End-to-end Machine Learning Tutorial

Migrating from Hadoop

Migrating from Hadoop towards a cloud native solution is relatively straightforward, but there are a few fundamental differences. Not all elements of the Hadoop ecosystem need an equivalent in Lentiq and also Hadoop does not include data science specific elements such as notebooks, reusable code blocks, model serving, cloud provisioning etc.

FunctionalityHadoopLentiq
Data Processing engineSpark/MapReduceSpark, Ray
Data StorageHDFSObject Storage (via Lentiq's abstraction layer)
Resource ManagerYARNKubernetes (abstracted by Lentiq)
SQL engineApache Hive, ImpalaPresto
QueueApache KafkaApache Kafka
Import, ETLSqoop, SparkSpark, Streamsets
Workflow SchedulerOozieLentiq Workflow
UIHUE, AmbariLentiq's UI
Command line interfacehdfs dfs CLILentiq bdl CLI
Data protectionHDFS replication & ECCloud provider's object storage durability SLA
Data localityHDFS collocated with YARNCompute nodes separate from storage
AuthenticationKerberosAPI key & JWT
AuthorizationLDAP + HDFS ACLObject Storage ACL, Lentiq permissions
Schema SharingHive MetastoreLentiq data management
Data CatalogApache AtlasLentiq Data Store
ScriptingPigLentiq LambdaBook
REST API GatewayApache KnoxLentiq's firewall management
3rd party application provisioningHadoop distribution specificLentiq Applications
Federation supportHDFS FederationLentiq Interconnected Data Pools

Of course this is not an apples-to-apples comparison as Hadoop is a downloadable application stack, whereas Lentiq is a SaaS service, but they both provide a secure and scalable data storage and processing environment, hence serve the same function.

Performance & Scalability considerations

Both Hadoop and Lentiq offer a similar performance profile for most datasets due to the increasing RAM availability and in-memory processing capabilities of Apache Spark. Specific use cases such as Deep Learning are much better supported in GPU-enabled Lentiq clusters than they are in Hadoop.

Hadoop's data locality with sophisticated task-to-data-node scheduling and short-circuiting mechanisms enable better per-node performance, especially in large datasets. In Lentiq data locality is sacrificed for the ability to scale the compute and storage clusters separately. This removes the need to over-provision compute or storage resources and enables comparable performance profiles at a lower cost.

Lentiq's ability to link multiple data pools together is similar in concept to HDFS's federation but much more complex. For performance and security reasons a separate data pool should be created for a separate team working on separate data. They can be provisioned within the same region to avoid cross-region traffic charges.

Importing structured data

While data in Hadoop was mostly imported via Sqoop and stored in sequence files, in Lentiq, Spark is very capable of connecting via JDBC to any data source hence the need for a dedicated solution no longer exists. Also, since in Lentiq it is very easy to turn a Notebook into a "task" that can be scheduled Spark can be used to not just import but also pre-process and analyze the data.

More complicated import strategies such as incremental imports should be done via the more sophisticated Streamsets application.

Importing unstructured data

In Hadoop, data is uploaded to HDFS either directly using the CLI or via WebHDFS HTTP interface. Lentiq provides a SFTPproxy service that allows direct ingestion with any sftp compatible client without the need for a dedicated client.

Find out more about this in the upload data getting started guide.

Managing data & permissions

In Hadoop, data is typically managed using the HDFS cli interface. However, Lentiq provides a Data Management view that can be used to manage both structured (tables) and unstructured data.

Data & Application permissions are allocated on a per project, per data-pool and per data-lake basis. A project resembles the group concept in Hadoop that has multiple users that are part of it. A data pool aggregates multiple projects and multiple data pools form a data lake.

Find out more about this in the Data Management User Guide.

Managing compute resources for processing

In Hadoop, tasks are submitted to YARN which handles resource allocation as well as task isolation and resource consumption enforcing. While YARN supports cgroup based isolation and dynamic resource pools, Kubernetes's container isolation and resource management is far more sophisticated and flexible.

Managing compute resources for applications

In some Hadoop distributions, applications that are part of the Hadoop ecosystem can be deployed using the provided UI. The equivalent mechanisms in Lentiq, applications, notebooks, workflows and reusable code blocks share the same compute cluster which improves resource usage efficiency.

For more details visit the Lentiq Applications.

← What is a project?Deploying applications and processing clusters →
  • Performance & Scalability considerations
  • Importing structured data
  • Importing unstructured data
  • Managing data & permissions
  • Managing compute resources for processing
  • Managing compute resources for applications
Copyright © 2019 Lentiq