Migrating from Hadoop

Migrating from Hadoop towards a cloud native solution is relatively straightforward, but there are a few fundamental differences. Not all elements of the Hadoop ecosystem need an equivalent in Lentiq and also Hadoop does not include data science specific elements such as notebooks, reusable code blocks, model serving, cloud provisioning etc.

Functionality	Hadoop	Lentiq
Data Processing engine	Spark/MapReduce	Spark, Ray
Data Storage	HDFS	Object Storage (via Lentiq's abstraction layer)
Resource Manager	YARN	Kubernetes (abstracted by Lentiq)
SQL engine	Apache Hive, Impala	Presto
Queue	Apache Kafka	Apache Kafka
Import, ETL	Sqoop, Spark	Spark, Streamsets
Workflow Scheduler	Oozie	Lentiq Workflow
UI	HUE, Ambari	Lentiq's UI
Command line interface	`hdfs dfs` CLI	Lentiq `bdl` CLI
Data protection	HDFS replication & EC	Cloud provider's object storage durability SLA
Data locality	HDFS collocated with YARN	Compute nodes separate from storage
Authentication	Kerberos	API key & JWT
Authorization	LDAP + HDFS ACL	Object Storage ACL, Lentiq permissions
Schema Sharing	Hive Metastore	Lentiq data management
Data Catalog	Apache Atlas	Lentiq Data Store
Scripting	Pig	Lentiq LambdaBook
REST API Gateway	Apache Knox	Lentiq's firewall management
3rd party application provisioning	Hadoop distribution specific	Lentiq Applications
Federation support	HDFS Federation	Lentiq Interconnected Data Pools

Of course this is not an apples-to-apples comparison as Hadoop is a downloadable application stack, whereas Lentiq is a SaaS service, but they both provide a secure and scalable data storage and processing environment, hence serve the same function.

Performance & Scalability considerations

Both Hadoop and Lentiq offer a similar performance profile for most datasets due to the increasing RAM availability and in-memory processing capabilities of Apache Spark. Specific use cases such as Deep Learning are much better supported in GPU-enabled Lentiq clusters than they are in Hadoop.

Hadoop's data locality with sophisticated task-to-data-node scheduling and short-circuiting mechanisms enable better per-node performance, especially in large datasets. In Lentiq data locality is sacrificed for the ability to scale the compute and storage clusters separately. This removes the need to over-provision compute or storage resources and enables comparable performance profiles at a lower cost.

Lentiq's ability to link multiple data pools together is similar in concept to HDFS's federation but much more complex. For performance and security reasons a separate data pool should be created for a separate team working on separate data. They can be provisioned within the same region to avoid cross-region traffic charges.

Importing structured data

While data in Hadoop was mostly imported via Sqoop and stored in sequence files, in Lentiq, Spark is very capable of connecting via JDBC to any data source hence the need for a dedicated solution no longer exists. Also, since in Lentiq it is very easy to turn a Notebook into a "task" that can be scheduled Spark can be used to not just import but also pre-process and analyze the data.

More complicated import strategies such as incremental imports should be done via the more sophisticated Streamsets application.

Importing unstructured data

In Hadoop, data is uploaded to HDFS either directly using the CLI or via WebHDFS HTTP interface. Lentiq provides a SFTPproxy service that allows direct ingestion with any sftp compatible client without the need for a dedicated client.

Find out more about this in the upload data getting started guide.

Managing data & permissions

In Hadoop, data is typically managed using the HDFS cli interface. However, Lentiq provides a Data Management view that can be used to manage both structured (tables) and unstructured data.

Data & Application permissions are allocated on a per project, per data-pool and per data-lake basis. A project resembles the group concept in Hadoop that has multiple users that are part of it. A data pool aggregates multiple projects and multiple data pools form a data lake.

Find out more about this in the Data Management User Guide.

Managing compute resources for processing

In Hadoop, tasks are submitted to YARN which handles resource allocation as well as task isolation and resource consumption enforcing. While YARN supports cgroup based isolation and dynamic resource pools, Kubernetes's container isolation and resource management is far more sophisticated and flexible.

Managing compute resources for applications

In some Hadoop distributions, applications that are part of the Hadoop ecosystem can be deployed using the provided UI. The equivalent mechanisms in Lentiq, applications, notebooks, workflows and reusable code blocks share the same compute cluster which improves resource usage efficiency.

For more details visit the Lentiq Applications.

Documentation

Managing applications

Managing data

Managing models

Managing workflows