A data lake is a combination of multiple data pools that belong to the same organization, and that are managed independently but are able communicate and share data and notebooks between them.
A data pool is a part of the data lake associated with an organization and it consists of a Kubernetes cluster that abstracts away the infrastructure layer and offers scalability, stability and portability for both data, applications and code.
The data pool can be deployed in a specific cloud provider and a specific region within that cloud provider and can collaborate through the data sharing mechanism with other data pools inside the same organization.
The data pool allows the administrator to better distribute resources to each data pool project. In this way, budgets are administered and coordinated separately, costs can scale more predictably, per individual project.
data pool project
A data pool project is an isolated pool of resources and data that are administered by the team that has access to the project. A project has a specific project quota set by the data pool administrator or project owner, and from that project quota users with access to the project can consume resources of memory and cpu for the applications that are needed for the application stack of their choice.
When the project quota is exceeded, no applications can be created until some resources are freed or the project quota is increased by the project manager or data pool administrator. This means the data pool administrator is ensured the project won’t drive up additional costs by consuming unnecessary resources across various environments.
Data uploaded in a project is visible only in the context of the project. To make it visible across the entire organization, files need to be published.
In terms of storage, each data pool project has a object storage bucket associated with it, and all data is isolated in that bucket. Only when data is shared, it gets replicated in a common object storage bucket as well to allow read access from the entire data lake, minimizing data replication.
data pool administrator
The data pool administrator is the person that can create data pools inside an organization. The data pool administrator can also create data pool projects, assign quotas per each project as well as assign one or more project administrators to each project.
The data pool administrator, has permissions to set data pool wide firewall rules that apply to all underlying data pool projects.
The project administrator is the person responsible to oversee the resource consumption within a project. The project administrator or the data pool administrator have permissions to add more users inside a data pool project. Added users can create applications, or scale clusters that are under their management, within the project resource quotas.
The project administrator has permissions to set project wide firewall rules that apply to all applications that are running inside the project.
The user has access to the applications and the data that are residing in the project. In addition, users can also access data and notebooks that have been shared by other colleagues from other projects with the rest of the organization.
The user can add new applications, as long as the resources requirements configured for this application are not exceeding the resource quotas allocated to the project.
A directory is an object living inside an object storage bucket, be it the bucket associated with a specific project, or a shared bucket that is available in the entire organization.
A file is an object living inside an object storage bucket, be it the bucket associated with a specific project, or a shared bucket that is available in the entire organization.
A table is created from within Spark and it is attached to the file from where it was created. Once tables are created from Spark, they are visible at the project level. When a file that has a table attached to it is published, the table is published as well and is visible across the organization.
An attachment is a detailed description of a particular object in the object storage, be it file or directory. The attachment can be any type of file, and in our design it represents an enriched documentation of data stored in the datalake.
When a file is published, the attachments are also available for the rest of the organization.
Metadata represents additional information entered by the users for describing directories, files as well as tables, and columns that are stored through our applications.
Metadata information is represented by detailed descriptions, attachments (notebooks, or more standardized files) and tags that increase data explainability and help users understand the scope of a particular dataset.
At the project level, metadata is optional, but recommended. However, when publishing, it becomes mandatory in order to democratize access to data for the rest of the organization and ensure a high level of consumption afterwards.
The application refers to the list of curated applications that are displayed in the left side menu, and that can be ordered and managed by users. Applications are carefully tuned for most frequent use cases, and users can tweak the CPU and memory allocation for each application.
Once they are selected and configured, applications will be readily available in seconds. Once they are started, they can be scaled horizontally and vertically at any time, as well as scaled to zero to minimize the resource waste when they are not needed.
Users that are added by a project administrator on a project, can also have the permission to publish. This setting has to be configured by the project administrator, to reduce the possibility of sharing sensible information across the entire organization. we recommend only a few users on a project to have such a role.
If the publisher wants to publish a new version of a published dataset, all its subscribers are going to be notified of the change.
All users that are part of a project have read access to all data that was published by their colleagues.
If they prefer to change the shared dataset, an option to do this is to create a copy of that dataset in the user’s project so that the published version is not changed.
When a published dataset is modified, or a new version appears, all its subscribers are notified.
Workflows are DAGs (graphs) of tasks that can be scheduled. Workflows can be run once as well for testing or debugging purposes. In a workflow, users can define complex data ingestion, processing, analytics and machine learning pipelines. A workflow can run in a particular project and will consume hardware resources from that project.
Tasks are the building blocks of a workflow. They can be created from our Reusable Code Blocks which can be viewed as a task template (with params and a docker image that is built automatically). Each task has some resource requirements that can be configured by the user to obtain the desired performance.
reusable code blocks
Reusable Code Blocks are execution units that can package a specific logic within a larger analytics or machine learning flow (such as training, ingestion, data anonymization, data deduplication, model serving). These reusable code blocks can be created from a docker image provided by the user (that can package a custom application) or a published notebook alongside their parameters. Reusable code blocks once created are available to the rest of the data lake, and can be linked together in various workflows.
Notebooks that are developed inside a project's Jupyter Notebook can be shared with the rest of the data lake through the publishing mechanism found inside the notebook interface. Once published, they can be used as a source for reusable code blocks.