What is a Data Lake

Data Lakes: A place for all your information

Sep 06, 2023

Companies are collecting more and more data every day, which comes in all formats and is not necessarily well structured. To store these vast amounts of data, companies are building data lakes. In this article, let's understand what data lakes are.

Before we get to the concept of a data lake, let’s first make sure we understand a few concepts about how companies store data.

Production Database
For any application, such as accounting software, or HR software, all the data is stored in a database. This is the core database that’s required for storing the essential data including things such as user profiles, customers data, orders data, payment transactions data, inventory data and so on. As people use the software, the data is exchanged using simple read-and-write operations. This database is the lifeline of the software. Such a database is referred to as a production database.
Data Warehouse
Most companies need some form of reporting or analytical layer on top of this data. Sometimes reports are quite simple. For example, the management wants to see a report of all the orders that were shipped the previous day. Such reports are easy to generate and can be created by running simple SQL queries on the production database.

Most often, companies need much more complex reporting with data coming from multiple data sources. For example, the company’s sales data will be coming from an eCommerce platform or third-party retailers, their marketing data will be in marketing platforms like Google Ads, Mailchimp, etc, and payment data will come from a payments processor such as Stripe.

In such a scenario, companies create a special place for data, called a data warehouse, where data is aggregated from all these sources, and stored specifically for analytical purposes.

In a data warehouse, companies can store vast amounts of historical data on which they can perform all kinds of analytics such as trend analysis, and forecasting. Unlike production databases, these data warehouses are optimized for complex queries and data analysis. Data warehouses are a key component of Business Intelligence (BI). They provide the processed and organized data that BI tools need to create reports, dashboards, and visualizations.

So, a business has a production database (which handles your day-to-day operational needs) as well as a data warehouse (which is structured and used for reporting and analysis). What comes next? A data lake, you say?

A data lake is like a vast reservoir where you can store all kinds of data, structured or unstructured, for long-term use. Not all the data is used immediately; some is stored for special cases or future analysis.

Why have a Data Lake?

Companies generate two types of data: Structured (like user data, order details, etc.) and Unstructured (log data, website views and clicks, product usage data, etc.). A Data Lake can hold both these types, making it a versatile tool in a company's data stack.

A data lake is a large-scale storage repository that holds raw data in its native format until it is needed. So, the data is just dumped in its raw, unprocessed form, without any transformation or cleansing. Only when specific data is needed, the data is retrieved, processed and analysed as per the need. Data lakes are equipped with robust data governance strategies to ensure data is catalogued, tagged, and meta-data is maintained so that the data remains searchable and usable.

Data Lakes vs. Data Warehouses

Data Lakes are often compared to Data Warehouses. The latter is ideal for structured data with predefined schemas, while the former suits unstructured data without any specific structure. This makes Data Lakes a good fit for special use case data.

The beauty of Data Lakes is the 'Schema-on-Read' concept. This means you can just throw in data without defining its structure beforehand. When you need to use it, that's when you structure it. This grants flexibility but also requires good data hygiene.

Data Lakes & Data Warehouses often work together. Companies typically start with a warehouse & then invest in a lake as unstructured data grows. So it's not about choosing between them but about leveraging them effectively.

Use Cases for Data Lakes

The huge amounts of data stored in data lakes can allow a company to perform a variety of analyses to improve its business. Some areas include big data analytics, artificial intelligence and machine learning, and even real-time analytics. Let’s look at some examples:

An e-commerce company can store and analyze massive volumes of transactional data, customer behaviour, product reviews, and clickstream data. This can help them identify sales trends, peak shopping periods, and customer preferences.
A hospital can use a data lake to store a variety of data such as medical records, lab reports, imaging data, and patient histories. They can train machine learning models on this data which can then help them in predicting disease patterns, help in diagnostics, and personalize patient treatment plans.
Financial institutions can use data lakes to perform real-time analytics by monitoring and analyzing real-time transactions. This can help with identifying fraud in real time and safeguard their customers.

Data Lake Solutions

Companies create data lakes using a variety of solutions. The popular choices for Data Lakes are often the native object storage products of major cloud providers, like AWS’s S3. Some of the solutions are mentioned below:

Amazon Web Services (AWS) offers S3 coupled with Lake Formation
Microsoft Azure Data Lake Storage
Google Cloud Storage
Cloudera Data Platform for organizations seeking hybrid solutions
Databricks' Delta Lake stands out with its approach to merging the strengths of data warehouses & lakes

To conclude, data lakes are a powerful, flexible solution for storing vast amounts of raw data, that can be used for varied analytical tasks. They are key to unlocking valuable insights and making informed decisions in today’s data-driven business world.

Modern Software

Discussion about this post

Ready for more?