1.What's a Data Lake? - Data Blog by Paing

![[Pasted image 20241120133812.png]] - Data Lake ဆိုတာက အဖွဲ့အစည်း၊စီးပွားရေး လုပ်ငန်းတစ်ခုအတွက် ဒေတာတွေ ၊ Machine Learning , Analytics အားလုံးကို central ဘုံစုစည်းလိုက်တဲ့ နေရာတစ်ခုလို့ ပြောလို့ ရပါတယ်။ - Data Lake မှာ unstructured , semi-structured နဲ့ structured data တွေအားလုံး ပါဝင်နိုင်တယ်။ - နောက်ပြီး ဘယ်လောက်ထိ ဒေတာ ပမာဏကြီးကြီး scaling လုပ်နိုင်စွမ်း ရှိတယ်။ - Data silos တွေကို ချိုးဖြတ်နိုင်တယ်။ - ဥပမာ - Media နဲ့ Telecom ကုမ္မဏီတစ်ခုမှာ ရောင်းချနေတဲ့ products တွေ အများကြီး ရှိတယ်။ - - Breaks data silos - To understand data silo - consider this example. Let say a media and telecom company has multiple products. Say they have internet services, table TV services, home security, and monitoring services. Customers can buy each of these products individually or in bulk. Now in reality what happens is that each of these products is actually handled by a totally different team or a department in an organization. Each product uses its own storage and database systems to store the required data. This makes it incredibly hard to analyze the customer from a 360 perspective. You really want to understand customer behavior across all of your product lines. But since each organization manage each product and have their own system of storing data, this creates data silos. Data Lake solves this problem, by Data Lake brings all the crucial enterprise data under one centralized system. This makes it easy for different organizations within a company to collaborate. - Schema on Read (ဖတ်ချိန်မှ Schema လိုအပ်) - Data Lakes use Schema on Read technique. Meaning you keep writing the data in the data lake maintaining its original raw format. And enforce the schema when you need to read this data back. This is a different approach than Data Warehousing. In data warehousing (just like in relational DBs) you need to specify the schema first. This makes the process of writing different types of data to data lake at scale very hard to achieve as schema generally evolves very quickly. - Enables Analytics Use Cases - The Data Lake democratizes data and is a cost-effective way to store all data of an organization for later processing. Data Analysts within the company can focus on finding meaningful patterns in data with more visibility and easy access to the required data. ## Differences between DL and DW Data Warehouse - Refined/structured/relational data - Schema designed at the beginning - Used by mainly business analysts - Used for historical analytics, visualizations, BI - Collects similar data from multiple resources Data Lake - Unstructured, unrefined, not relational data - schema designed at the end - Use by data scientists and data developers, business analysts - Used for predictive analytics, machine learning - Connecting various types of data from a wide variety of sources ## Data Lake Elements - Governance - Broad term, including management of the data lake of the whole. Includes availability, monitoring, access control policies - Security - Secure your data using encryption at rest and in transit. Authentication, authorization, accounting, and data protection are key features. - Quality - Ensuring that the data in the data lake is most up to date and consistent. Keep in mind _garbage in -> garbage out_ - Catalog - A centralized metadata repository about the actual data set with descriptions, purpose, schema, and lineage information. This helps break data silos - Audit - Track data access from every application and data resource. Build log trails with proactive monitoring and alerting capabilities - Lineage - This handles the data's origin. Where the data moves to and what happens to it.