Optimization
This topic provides a detailed examination of the optimization process. LeapLogic’s intelligent grammar engine identifies optimization opportunities at schema, and orchestration levels.
In this Topic:
Schema
Schema refers to how data is organized within a database. Optimizing schema is critical based on the complexity of the queries in the schema. By interpreting the input files, DDL files help in retrieving schema-level information and generating corresponding recommendations for the target. If the target is Databricks, the schema optimization strategies are Bloom Filter index, ZOrder index, partition by, tune based on workload, tune based on table size, and Compaction (bin-packing).
Schema optimization opportunities include:
- Increasing parallelism
- Joins and Update/Delete/Merge operations can be optimized
- Entity Name: Name of the entity.
- Bloom Filter Index: A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns, particularly for fields containing arbitrary text.
- ZOrder Index: ZORDER BY co-locates the column information in the same set of files. Co-locality is used by Delta Lake data-skipping algorithms to dramatically reduce the amount of data that needs to be read.
- Partition By: Partition based on the selected value.
- Tune Based on Workload: If a Delta table experiences frequent MERGE, UPDATE, or DELETE operation, then this helps to tune file sizes for rewrites by setting the table property delta.tuneFileSizesForRewrites.
- Tune Based on Table Sizes: Helps to configure the target file size based on source table size. For example, for 1 TB table size, the size of the target file should be 256 MB; for 10 TB, it should be 1 GB and likewise.
- Compaction (bin-packing): Delta Lake on Databricks improves the speed of read queries from a table by coalescing small files into larger ones.
- Data Source: Provides data source details.
- Entity Type: Displays the type of entities.
- Frequency of Use: Displays the frequency of table used.
- Transactional: Displays the number of data flow components that are used to perform sorting, merging, data cleansing and so on.
- Data Volume: Displays the quantity of data.
- Primary Key: To identify unique row in the table. It will not accept NULL values.
- Unique Key: To identify records in a table.
- Details: Provides additional details about the entities.
Orchestration
This section details data splits and recommends the maximum possible number of queries that can be executed in parallel for orchestration optimization.
Orchestration helps to easily manage workflows to accomplish a specific task. In orchestration, the independent queries of each file are divided into different splits that run in parallel to increase efficiency.
As per the above figure, the file contains 5 queries, with 1 split based on query dependency. By running the splits in parallel, we can enhance the performance and execution time.
To get more insight, navigate to Files‘ row > Number of splits row, you can view the query and graphical distribution of the statement type.
Orchestration’s GRAPH displays the dependencies between files, splits, and output tables. To get the dependency information regarding splits and queries, choose the preferred node.