Azure Data Lake includes many of the tools required to make it easier for developers, data scientists, and analysts to manage all sizes, shapes, and speeds and execute complex processing and analytics across platforms and languages. Azure has tools that remove the complexities of ingesting and storing all of your data while using accelerators for batch, streaming, and interactive analytics. Azure Data Lake works with the investment IT groups for identity, management, and security for simplified data management and governance. The dates lakes created with Azure integrate seamlessly with operational stores and data warehouses so you can extend current data applications to it.
Azure Data Lake addresses many of the challenges that prevent you from maximizing the value of your data assets with a service that’s ready to meet your current and future business needs.
New to Azure Data Lake is the Data Lake Management Portal that allows you to set up, manage, and consume your Azure Data Lakes in one easy-to-use interface. This new portal includes a complete data lake catalog and a dashboard for quickly navigating your data lake. In addition, it also provides easy access to the existing relational, log, and analytics tools that you can easily switch between to use with your data lake.
With Azure Data Lake, it’s more accessible to:
- Create, manage, and consume data lakes with hundreds of gigabytes, terabytes, or petabytes of data (available to be streamed to Azure HDInsight for Hadoop-based analytics). You can create hundreds of data lakes in a single day by integrating with Azure Blob Storage and Azure Table Storage.
- Access your data lakes from various mobile devices and other environments (note that connecting a mobile device to Azure Data Lake should be an optional step).
- Setup Cloud Data Lakes and Azure Data Lake Storage to work with on-premises Azure Blob Storage and Azure Table Storage. The migration is much simpler than with most other AWS services and Azure.
- Make REST API calls from Data Lake Manager to your on-premises Azure HDInsight cluster. The REST API also allows you to integrate data lakes and Azure HDInsight to use a single Azure Data Lake for Hadoop-based analytics across the organization.
- Extend on-premises data services that enable you to seamlessly ingest data to Azure Data Lake using Azure Pub/Sub.
- Use Microsoft Machine Learning services in your on-premises Azure Data Lake.
- Interoperate with legacy on-premises Hadoop clusters to extend the capabilities of data lakes, including integration with Hortonworks Hadoop distribution, enabling you to run interactive machine learning, collaborative analytics, and high-speed ingestion.
Azure Blob Storage and Azure Table Storage
Azure Blob Storage is a scalable file system that enables you to easily create and store files of any size, structure, and complexity, on-premises and in the cloud. It provides rich file indexing capabilities, basic compression, and efficient storage of files with persistence through volumes (running in an Azure Blob Store) that reside in your cloud service, on-premises, or on a third-party server. Azure Blob Storage is a secure location for storing large quantities of unstructured data. Azure Blob Storage supports Hadoop as a service and supports Hadoop clusters running in Azure HDInsight. It can be consumed by the data lake and the operational data warehouse through a single integration.
Azure Table Storage is the premium cloud service for storing and querying data and views from SQL Server. It supports multiple client types, such as Microsoft Excel, Microsoft Access, Microsoft Power BI, Oracle, MySQL, and HTML/Hypertext Markup Language (HTML). It is natively integrated with Windows Server and can be easily consumed by systems that require an online SQL Server without converting the data to an offline SQL server.
The Azure Data Lake is at the heart of the Azure Data Insight data warehouse service. Azure Data Insight can be a primary data source for queries, a cache for backup and protection against inconsistency, and an engine for working with the data stored in Azure Blob Storage and Azure Table Storage. Azure Data Insight includes the full suite of tools for:
- Hadoop-based computing infrastructure
- Integrated data warehouse
- Data integration and analytics services
- Visualizations and reporting services
Available in the cloud, the Data Insight service is a turnkey solution for large-scale data processing, analytics, and reporting.
In addition to Azure Data Lake and Azure HDInsight, the Azure Data Insight service includes the following built-in capabilities:
- Instant, persistent backups of Azure HDInsight clusters
- Auto-scaling to meet performance demands
- User-configurable and scale-out cluster support
- Discounted compute usage for development and test applications
- Logging for most applications and cluster storage
- Ability to pass log data to Azure Analyze
- Post-processing and dynamic analysis
- Integrated Data Science and Machine Learning services
Azure Stream Analytics – allows you to prepare and analyze streaming data in real-time with streaming and batch integration, as well as analytical integration with the Cloud Data Lake Service.
The Analytics Dashboard (the heart of the Analytics D&I solution) provides actionable insight into your data by recognizing and automatically highlighting unexpected patterns and phenomena to drive actionable insight.
Remote Data Protection
Azure Data Lake protects your data by encrypting, splitting, and distributing it on your behalf and using TLS to protect all connections from your data center to the Internet. Azure Data Lake uses state-of-the-art cryptographic techniques to encrypt every layer in the data flow, thus keeping data out of the reach of criminals.
Azure Data Lake provides data governance services such as stream aggregators, deep linking, and built-in security policies. Azure Data Lake provides a unified, flexible, and scalable governance environment that fits a range of different data and analytic models, whether on-premises or in the cloud, with all the benefits that centralized management can bring.
Designed for use in any use case where you need to replicate continuously, process, analyze, and store large amounts of diverse and transient data, Azure Data Lake is a very flexible SQL platform and a real-time data platform.
Azure Blob Storage
Azure Blob Storage (DS) provides scalable storage of complex data in a cost-efficient manner. Azure Data Lake supports the following access models:
- The primary mode supports a single query and no more than 1 GB of data.
- The enhanced way supports both a single question and more than 1 GB of data.
- Stream mode allows for updates of as much as one petabyte per day.
- Azure Blob Storage supports scalability up to 2PB in a single cluster.
- Azure Blob Storage supports Azure Blob Snapshots for fast replications and disaster recovery, capable of running queries with or without a snapshot.
- Azure Blob Storage supports full customer quotas and policies.
- Azure Blob Storage supports guaranteed data safety for many compliances and regulatory frameworks, such as HIPAA and GDPR.
- Azure Blob Storage supports 99.9999% availability and high throughput for backups and replications.
Any query that uses the Primary access mode will share data with the Enhanced mode query if they both require the same processing environment. At the same time, both methods support a new question without sharing data.
Metadata can be shared across all querying modes using the Aggregation Pipeline shared metadata.
Azure MapReduce provides one of the most scalable and reliable parallel processing environments available today. A single vCPU unit can perform up to 500M similar MapReduce jobs, meaning one machine can simultaneously process 50 million map jobs, performing in an average of less than one minute per query.
Azure MapReduce supports both batch processing and streaming, and it supports all programming models, including SQL.
Azure Storage Gateway provides a managed storage connection between your on-premises datacenter and Azure SQL Database.
Azure Data Lake collects, stores, and handles various data types, allowing you to use the many available tools to make sense of your data.
Depending on how you use it, Azure Data Lake offers flexible pricing with features that start at just $0.005/GB/day. You can begin with one-click integration with any of the existing SQL tools in the Azure Marketplace.
Want to begin your career as a Data Engineer? Check out the Data Engineering Certification Program and get certified.
Azure Data Lake is the answer to the complexity of managing and processing large volumes of data. It allows you to run rich analytics with the speed and performance of SQL Server and the flexibility of Azure with all the benefits of a managed database service.
Based on Azure SQL Data Warehouse, Azure Data Lake provides the flexibility to use a diverse and complex data set in your real-time applications. The solution is easy to install, easy to maintain, and with a proven track record of high reliability and uptime. Azure Data Lake provides strong support for data-sharing and migration of data from on-premises to the cloud, with a specific set of APIs to enable a wide variety of integration scenarios.
Azure Data Lake combines the best of SQL Server and Azure. It allows you to take advantage of Azure’s scale and performance and SQL Server’s flexibility.
Simplilearn offers courses to help you deepen your understanding of data lakes. The Data Engineering Certification Program in partnership with Purdue University covers managing Big Data, including Azure data lakes. You can add to your knowledge of Azure with our cloud computing programs, including the Post Graduate Program in Cloud Computing in collaboration with Caltech CTME.