Migrating from Hadoop
Migrating from Hadoop towards a cloud native solution is relatively straightforward, but there are a few fundamental differences. Not all elements of the Hadoop ecosystem need an equivalent in Lentiq and also Hadoop does not include data science specific elements such as notebooks, reusable code blocks, model serving, cloud provisioning etc.
Functionality | Hadoop | Lentiq |
---|---|---|
Data Processing engine | Spark/MapReduce | Spark, Ray |
Data Storage | HDFS | Object Storage (via Lentiq's abstraction layer) |
Resource Manager | YARN | Kubernetes (abstracted by Lentiq) |
SQL engine | Apache Hive, Impala | Presto |
Queue | Apache Kafka | Apache Kafka |
Import, ETL | Sqoop, Spark | Spark, Streamsets |
Workflow Scheduler | Oozie | Lentiq Workflow |
UI | HUE, Ambari | Lentiq's UI |
Command line interface | hdfs dfs CLI | Lentiq bdl CLI |
Data protection | HDFS replication & EC | Cloud provider's object storage durability SLA |
Data locality | HDFS collocated with YARN | Compute nodes separate from storage |
Authentication | Kerberos | API key & JWT |
Authorization | LDAP + HDFS ACL | Object Storage ACL, Lentiq permissions |
Schema Sharing | Hive Metastore | Lentiq data management |
Data Catalog | Apache Atlas | Lentiq Data Store |
Scripting | Pig | Lentiq LambdaBook |
REST API Gateway | Apache Knox | Lentiq's firewall management |
3rd party application provisioning | Hadoop distribution specific | Lentiq Applications |
Federation support | HDFS Federation | Lentiq Interconnected Data Pools |
Of course this is not an apples-to-apples comparison as Hadoop is a downloadable application stack, whereas Lentiq is a SaaS service, but they both provide a secure and scalable data storage and processing environment, hence serve the same function.
Performance & Scalability considerations
Both Hadoop and Lentiq offer a similar performance profile for most datasets due to the increasing RAM availability and in-memory processing capabilities of Apache Spark. Specific use cases such as Deep Learning are much better supported in GPU-enabled Lentiq clusters than they are in Hadoop.
Hadoop's data locality with sophisticated task-to-data-node scheduling and short-circuiting mechanisms enable better per-node performance, especially in large datasets. In Lentiq data locality is sacrificed for the ability to scale the compute and storage clusters separately. This removes the need to over-provision compute or storage resources and enables comparable performance profiles at a lower cost.
Lentiq's ability to link multiple data pools together is similar in concept to HDFS's federation but much more complex. For performance and security reasons a separate data pool should be created for a separate team working on separate data. They can be provisioned within the same region to avoid cross-region traffic charges.
Importing structured data
While data in Hadoop was mostly imported via Sqoop and stored in sequence files, in Lentiq, Spark is very capable of connecting via JDBC to any data source hence the need for a dedicated solution no longer exists. Also, since in Lentiq it is very easy to turn a Notebook into a "task" that can be scheduled Spark can be used to not just import but also pre-process and analyze the data.
More complicated import strategies such as incremental imports should be done via the more sophisticated Streamsets application.
Importing unstructured data
In Hadoop, data is uploaded to HDFS either directly using the CLI or via WebHDFS HTTP interface. Lentiq provides a SFTPproxy service that allows direct ingestion with any sftp compatible client without the need for a dedicated client.
Find out more about this in the upload data getting started guide.
Managing data & permissions
In Hadoop, data is typically managed using the HDFS cli interface. However, Lentiq provides a Data Management view that can be used to manage both structured (tables) and unstructured data.
Data & Application permissions are allocated on a perproject, per data-pool and per data-lake basis. A project resembles the group concept in Hadoop that has multiple users that are part of it. A data pool aggregates multiple projects and multiple data pools form a data lake.
Find out more about this in the Data Management User Guide.
Managing compute resources for processing
In Hadoop, tasks are submitted to YARN which handles resource allocation as well as task isolation and resource consumption enforcing. While YARN supports cgroup based isolation and dynamic resource pools, Kubernetes's container isolation and resource management is far more sophisticated and flexible.
Managing compute resources for applications
In some Hadoop distributions, applications that are part of the Hadoop ecosystem can be deployed using the provided UI. The equivalent mechanisms in Lentiq, applications, notebooks, workflows and reusable code blocks share the same compute cluster which improves resource usage efficiency.
For more details visit the Lentiq Applications.