2.Big Data Format - Data Blog by Paing

## Consideration 1 : File Size - Using small number of larger files will lead to better performance than a large number of small files. ## Consideration 2 : Type of Storage (Row Oriented vs Columnar Oriented) ### Row Oriented Format - Traditional and very popular format where data is stored and retrieved one row at a time and hence could read unnecessary data if some of the data in a row are required. - Example: if you have a table with 1000 columns, and when you only need to read only 10 columns, you will still need to read the entire row (of 1000 columns) and then filter out unnecessary data. Because of this, row-oriented data formats (such as CSV, TXT, AVRO, or relational databases) are NOT efficient in performing operations applicable to the entire datasets and hence aggregation in row-oriented is an expensive job or operations. - Records in Row Oriented Datastores are easy to read and write. Often times relational databases are uses this mechanism for online transaction system. - Typical compression mechanisms which provide less efficient result than what we achieve from column-oriented data stores. ### Column Oriented Format - Data is stored and retrieve in columns and hence it can only able to read only the relevant data if required. - Example, if you have a table with 1000 columns, and when you only need to read only 10 columns, there is NO need to read the rest of the columns at all. This leads to huge performance enhancements. - Many big data technology tools such as Parquet and NoSQL databases are optimized to deal with Column-oriented formats leading to efficient performance. - Column-oriented data formats offer high compression rates due to little distinct or unique values in columns. - Examples of Column Oriented Data formats are: ORC, PARQUET ## Consideration 3 : Schema Evolution Questions to ask : 1. Does the data format support schema evolution? 2. What is the impact on the existing data set if the new data set is received with an updated schema? 3. How easy or hard is it to update the schema (such as adding, removing, renaming, restructuring, or changing the data type of a field) 4. Does your data need to be human-readable? 5. What is the impact of schema evolution over the file size and processing speed? 6. How would you store different versions of the schema? 7. How different versions of the schema will integrate with each other for processing? ## Consideration 4 & 5: Compression & Splitability (Split + Ability) |**Compression Format**|**Splittable?**|**Read/Write Speed**|**Compression Level**| |---|---|---|---| |.gzip|No|Medium|Medium| |.bzip2|Yes|Slow|High| |.lzo|Yes, if indexed|Fast|Average| |.snappy|No*|Fast|Average| - New formats (new technologies) - Avro , Parquet, ORC, Feather, Arrow - [Link1](https://medium.com/@a.anekpattanakij/big-data-file-formats-introduction-to-arvo-parquet-and-orc-file-d153f0f20b1e), [Link2](https://pnavaro.github.io/big-data/14-FileFormats.html) - Data Engineers need to consider which file format is the best suitable thing. Compare Pros and Cons. ![[Pasted image 20241120140230.png|600]] - Data Formats | | **PARQUET** | **AVRO** | **ORC** | | --------------------------- | ----------- | -------- | ------- | | **Schema Evolution** | Good | Best | Better | | **Compression** | Better | Good | Best | | **Splitability** | Good | Good | Best | | **Row vs. Column Oriented** | Column | Row | Column | | **Read/Write Speed** | Read | Write | Read | - Pros ![[Pasted image 20241120152708.png]]