Optimization
This topic provides a detailed examination of the optimization process. LeapLogic’s intelligent grammar engine identifies optimization opportunities at schema, and orchestration levels.
In this Topic:
Schema
Schema refers to how data is organized within a database. Optimizing schema is critical based on the complexity of the queries in the schema. By interpreting the input files, DDL files help in retrieving schema-level information and generating corresponding recommendations for the target. If the target is Databricks, the schema optimization is based on liquid clustering. Liquid clustering organizes data based on the columns specified in the Cluster By key.
Schema optimization opportunities include:
- Increasing parallelism
- Joins and Update/Delete/Merge operations can be optimized
- Entity Name: Name of the entity.
- Cluster By: Column name specified here is used to organize data within the table.
- Frequency of Use: Displays the frequency of table used.
- Primary Key: To identify unique row in the table. It will not accept NULL values.
- Tune Based on Workload: If a Delta table experiences frequent MERGE, UPDATE, or DELETE operation, then this helps to tune file sizes for rewrites by setting the table property delta.tuneFileSizesForRewrites.
- Tune Based on Table Sizes: Helps to configure the target file size based on source table size. For example, for 1 TB table size, the size of the target file should be 256 MB; for 10 TB, it should be 1 GB and likewise.
- Compaction (bin-packing): Delta Lake on Databricks improves the speed of read queries from a table by coalescing small files into larger ones.
- Data Source: Provides data source details.
- Entity Type: Displays the type of entities.
- Transactional: Displays the number of data flow components that are used to perform sorting, merging, data cleansing and so on.
- Data Volume: Displays the quantity of data.
- Unique Key: To identify records in a table.
- Details: Provides additional details about the entities.
Orchestration
This section details data splits and recommends the maximum possible number of queries that can be executed in parallel for orchestration optimization.
Orchestration helps to easily manage workflows to accomplish a specific task. In orchestration, the independent queries of each file are divided into different splits that run in parallel to increase efficiency.
As per the above figure, the Zero_code.sql file contains 57 queries, with 10 splits based on query dependency. By running the splits in parallel, we can enhance the performance and execution time.
Navigate through the required files in the table to get more insights about Number of splits. Here, you can view queries and graphical distribution of the statement types.
Orchestration’s GRAPH displays the dependencies between files, splits, and output tables. To get the dependency information regarding splits and queries, choose the preferred node.