On February 24, Chandan Vijay, Senior Director of Products of Cognizant, presented a Simplilearn expert webinar on the emergence of data lakes and their impact on the digital economy. As a 20-year veteran of the technology industry, Chandan is an outcome-driven digital & technology leader in data and analytics.
What is a Data Lake?
Chandan described data lakes with three dimensions:
1. Data at Scale
A data lake is a data repository that hosts a large amount of data and that preferably leverages cloud and big data technologies.
2. Variety of Data
A data lake hosts data that is structured, semi-structured, and unstructured.
Chandan also listed what differentiates a data lake from other database technologies such as data warehouses or data marshes:
Unprocessed vs. processed. A data lake can store unprocessed data for later processing as needed, while other databases may require pre-processing of the raw data.
Type of Data
All types vs. structured data. A data lake can store any format of data, while other databases may impose a requirement for data to be structured within the database.
Lowest level of granularity vs. summary or aggregated. A data lake is suited to storing the most granular data in close to its raw form, while other databases may want the quantity of data to be reduced through summarization or aggregation.
Low vs. high. A data lake accepts data as it is created, where other database types may impose a processing schedule for accepting processed data.
Big data analytics vs. traditional reporting. A data lake can support the widest variety of data uses, including big data analytics, real-time analytics, and machine learning, while other databases may be optimized for dashboards, metrics, and other traditional reporting.
A data lake encompasses data storage and processing technologies, processes for moving data into and out of the lake, and policies for how data is handled and protected in the lake. Figure 1 summarizes these key elements.
Figure 1: Elements of a Data Lake
The Lifecycle of Data in a Data Lake
Data in a data lake has a defined lifecycle, going through processes to enter the lake and other processes to be accessed for use. The sequence of processes is:
Creating the inbound pathway for data, whether from original data sources or other systems.
Conditioning the data and introducing it into the lake, which often includes mirroring data from other systems and streaming real-time inbound data.
Combining data from various sources into a meaningful storage scheme.
Making the data available in a form that each data-consuming application can use.
Transferring data in its transformed state from the data lake to the data-consuming applications.
Delivering the data to its point of use.
The processing in each step can be done either in a batch mode or a streaming mode. Batch processing deals with large batches of data and happens on a defined schedule, either at set time intervals or at set trigger events. It thus has high latency and delay time, sometimes measured in hours or days. Stream processing happens in real-time on micro-batches of data. It has low latency and delay times that can be milliseconds to seconds. Because data lakes are configured to deal with highly granular data, they are well-suited for stream processing.
Digital Transformation Through Data Lakes
The traditional view of databases has been to analyze an existing business process, determine what data it uses and how it uses that data, and design and build a database, data warehouse, or data marsh to support that process. This focus on the business process's operational efficiency tends to lock the organization into the process and make it difficult to evolve or replace the process as business needs change since the very structure of the organization’s data reflects the existing process.
By contrast, a data lake manages the organization’s data independently of the processes that use that data. The process only imposes the requirements that the data lake contains the data the process needs and has a transformation that makes the data compatible with the process. At the same time, the data lake doesn’t impose constraints or restrictions on any new processes that can be built to use the organization’s data. In the current climate of rapid digital transformation driven by technology advances and the social changes imposed by the COVID-19 pandemic (work from home, social distancing, non-contact product delivery), this flexibility is a big advantage for organizations that need to be agile in adapting their processes to new requirements.
With a data lake in place, the data becomes available to a wider range of the organization’s internal and external data consumers. What’s even more interesting is that this expanded range means that different data consumers may end up communicating with each other about what data each uses and how, and this can create new opportunities for using the organization’s data in unexpected and highly innovative ways.
Analytics Through Data Lakes
Data analytics generally falls into four broad categories:
- Descriptive: “where”
- Diagnostic: “why”
- Predictive: “what and when”
- Prescriptive: “what, when, and simulation”
Descriptive analytics looks at where in the organization things have happened. It is traditional reporting of events and facts, and traditional data management methods are well-suited to this task.
Diagnostic analytics adds a layer of meaning by looking at why certain events or results happened. Data lakes help with this type of analytics because the data from disparate systems or processes are together in one place for analysis. It’s therefore much easier to explore connections between different events and results to model explanations for the results of interest.
Predictive analytics also can exploit the deep history stored in a data lake to build better models of how businesses operate and what results they achieve. The data lake’s granularity and length of history allow you to test predictive models more extensively against historical data to create the model that best fits the observed data.
Prescriptive analytics takes predictive analytics to another level. Using predictive analytics in a "what if" exploration, you can create simulations to examine the effects of different policy or procedural changes. For example, you can draw on your historical data to build a prescriptive model of accounts payable to find the optimal number of days to pay invoices.
Get broad exposure to key technologies and skills used in data analytics and data science, including statistics with the Post Graduate Program in Data Analytics.
The Audience’s Questions - and Yours
Chandan took many questions from the webinar’s live audience. You can see the entire event, including the Q&A, in the video above.
Simplilearn offers many courses and programs in data science to take your career to new heights. Data lakes are just one of the data engineering topics included in the Post Graduate Program in Data Engineering with Purdue University. If your interest in data science goes to data architecture and advanced applications in AI and machine learning, you should consider the Post Graduate Program in Data Science with Purdue University.