Article | Cloud migration, Cloud native, Data analytics

Building a serverless data lake on AWS

A serverless data lake is a popular way of storing and analyzing data in a single repository and features autonomous maintenance and architectural flexibility for diverse kinds of data, which in turn accelerates integration with analytics engine and improves time to insights.

23 Feb 2021

This article was contributed to Inform by a member of TM Forum .

Build your cloud native data lake 30-40% faster

All DSPs require a high-quality data storage and analytics solution which offers more agility than conventional systems. A serverless data lake is a popular way of storing and analyzing data in a single repository. It features autonomous maintenance and architectural flexibility for diverse kinds of data, which in turn accelerates integration with analytics engine and improves time to insights.

Inefficient data lakes

DSP data lakes today do not seem to meet expectations due to reasons such as:

The data lake becoming a “data swamp”
Lack of business impact
Increasing complexities in data pipeline replication

Parameters to build an efficient AWS serverless data lake

Listed below are a few critical parameters which can help DSPs mitigate these challenges and implement a scalable and flexible data lake enabled by serverless technologies. 1. Data architecture workshop – A consultant facilitated a ‘business value first’ approach at the workshop where the DSP is guided towards architected principles including:

Identifying the highest priority business cases with strategic cost/benefit analysis and incorporating the elements into architecture and design.
Building flexibility into the design to address the changing needs of the business, IT, and other functions as requirements and use cases evolve.
Constructing a long-term architecture blueprint with serverless technology elements, integrations, performance and scalability.

Identifying and prioritizing the use cases that provide business units with maximum value from data lake is a critical success factor.

2. Interface control template – Interface control templates assist firms in integrating the numerous interoperating services that come together in a modern data platform. The templates capture the interactions between the services and how they work in tandem with each other. Having an interface control template is critical to establish an event-driven orchestration in the data lake as it provides a standard way of integration across the services. The pre-defined integration procedures further help in reducing any rework needed for data lake implementation. 3. Infrastructure as code (IaC) – The effort of building a pipeline can be reduced using event-driven, infrastructure-as-code applications for diverse source data systems – NOC, CMS, enterprise data. This is because IaC makes use of reusable templates. IaC is a paradigm for provisioning and managing serverless applications through a series of software using a cloud configuration orchestrator. Using the orchestrator helps to spawn new pipelines with the necessary Amazon Web Services (AWS) resources. Instead of recreating the entire pipeline for minor changes, resources can be effectively spawned by changing a few configuration files in the orchestrator.

Fig 1: Infrastructure as Code (IaC) to execute data pipeline engineering sprints

Infrastructure as Code (IaC) techniques can make pipeline build & replication process 60% faster.

4. Data cataloging approach - DSPs today are moving towards more generalized data lakes where raw data is gathered from various sources and stored in its native format. Due to a lack of governance, descriptive metadata, and a mechanism to maintain it, the data lake becomes a convoluted data swamp. To avoid data swamps, DSPs could follow the steps below: 5. Event-driven orchestration - DSPs usually have multiple data pipelines and data sources for different use cases. Managing the complexity of multiple pipelines is key to managing data lakes. Event-driven orchestration using AWS step function enables serverless queries and serverless polling, as it has features to poll for the ‘extract, transform, load’ jobs and triggers any necessary next steps upon completion. The step functions orchestrate the sequence in which the job needs to be run, and assist in automating the end-to-end workflow of the data pipeline. Event-driven orchestration in a serverless data lake ensures end-to-end automation of data flows and data transformations, which improves the total data processing time by 40%.

Metadata: Create metadata using a ‘glue crawler’ which helps to populate the AWS Glue Data Catalog and crawl multiple data stores in a single run.
Data catalog: Create a data catalog of where the data lies and the path traveled by the data. This is used for lineage tracking.
Single hierarchy structure: Due to multiple handoffs and stakeholders, hierarchies in the data lake end up inconsistent and confused. Hence, ensure that a single hierarchy and set naming conventions are used across the data sources to reduce complexity. As Amazon S3 is a global namespace, we recommend firms explicitly choose specific names or locations for the datasets to be stored and transformed.
Data governance: Since data from different sources are brought to a centralized data lake, we recommend that any identity and access management procedures hold stringent access rules.

Fig 2. Event-driven orchestration to automate data transfer and transformation

Conclusion Serverless technologies built natively on cloud, benefit from cloud-scale innovation and provide flexible architectures for a variety of use cases beyond just data lakes. By implementing the best practices elaborated in this insight, a leading DSP in Latin America was able to realize benefits such as:

Accelerated time to build a fully operational cloud native data lake by 30-40%
Quicker launch of analytics and machine leaning use cases, with a 50-60% reduction in machine learning integration efforts
Significantly less costs than infrastructure as a service or on-premise data lake infrastructure