A serverless data lake is a popular way of storing and analyzing data in a single repository and features autonomous maintenance and architectural flexibility for diverse kinds of data, which in turn accelerates integration with analytics engine and improves time to insights.
Building a serverless data lake on AWS
This article was contributed to Inform by a member of TM Forum .
All DSPs require a high-quality data storage and analytics solution which offers more agility than conventional systems. A serverless data lake is a popular way of storing and analyzing data in a single repository. It features autonomous maintenance and architectural flexibility for diverse kinds of data, which in turn accelerates integration with analytics engine and improves time to insights.
DSP data lakes today do not seem to meet expectations due to reasons such as:
Listed below are a few critical parameters which can help DSPs mitigate these challenges and implement a scalable and flexible data lake enabled by serverless technologies. 1. Data architecture workshop – A consultant facilitated a ‘business value first’ approach at the workshop where the DSP is guided towards architected principles including:
Identifying and prioritizing the use cases that provide business units with maximum value from data lake is a critical success factor.
2. Interface control template – Interface control templates assist firms in integrating the numerous interoperating services that come together in a modern data platform. The templates capture the interactions between the services and how they work in tandem with each other. Having an interface control template is critical to establish an event-driven orchestration in the data lake as it provides a standard way of integration across the services. The pre-defined integration procedures further help in reducing any rework needed for data lake implementation. 3. Infrastructure as code (IaC) – The effort of building a pipeline can be reduced using event-driven, infrastructure-as-code applications for diverse source data systems – NOC, CMS, enterprise data. This is because IaC makes use of reusable templates. IaC is a paradigm for provisioning and managing serverless applications through a series of software using a cloud configuration orchestrator. Using the orchestrator helps to spawn new pipelines with the necessary Amazon Web Services (AWS) resources. Instead of recreating the entire pipeline for minor changes, resources can be effectively spawned by changing a few configuration files in the orchestrator.
Infrastructure as Code (IaC) techniques can make pipeline build & replication process 60% faster.
4. Data cataloging approach - DSPs today are moving towards more generalized data lakes where raw data is gathered from various sources and stored in its native format. Due to a lack of governance, descriptive metadata, and a mechanism to maintain it, the data lake becomes a convoluted data swamp. To avoid data swamps, DSPs could follow the steps below: 5. Event-driven orchestration - DSPs usually have multiple data pipelines and data sources for different use cases. Managing the complexity of multiple pipelines is key to managing data lakes. Event-driven orchestration using AWS step function enables serverless queries and serverless polling, as it has features to poll for the ‘extract, transform, load’ jobs and triggers any necessary next steps upon completion. The step functions orchestrate the sequence in which the job needs to be run, and assist in automating the end-to-end workflow of the data pipeline. Event-driven orchestration in a serverless data lake ensures end-to-end automation of data flows and data transformations, which improves the total data processing time by 40%.
Conclusion Serverless technologies built natively on cloud, benefit from cloud-scale innovation and provide flexible architectures for a variety of use cases beyond just data lakes. By implementing the best practices elaborated in this insight, a leading DSP in Latin America was able to realize benefits such as: