Redshift Artifact Extraction Prerequisites

This topic briefs about the prerequisites required for Redshift artifact extraction.

In This Topic:

Introduction
Artifact Extraction
Getting Help

Introduction

LeapLogic’s Assessment profiles existing inventory, identifies complexity, performs dependency analysis, and provides recommendations for migration to modern data platform.

Artifact Extraction

LeapLogic requires certain artifacts to perform an assessment. As a prerequisite, you need to have a super user privilege to fetch the required data. You can copy them from your GIT instance or the Redshift repository where all the artifacts such as DDL scripts, stored procedures, functions, query execution logs, DML scripts, and other database objects are stored. LeapLogic needs all these artifacts in the form of .sql files.

DDL Scripts

There are two options available for extracting DDL scripts. The first option is to use the UNLOAD command. The second option is to use the AWS Redshift Query Editor.

Note:

We recommend using the UNLOAD command, especially when the file size is substantial.

Please follow the below steps to export the DDL scripts using the query editor from the AWS Query Editor.

Click Amazon Redshift from the AWS Console.
Select the Redshift cluster from the Cluster Overview tab.
Click Query Data highlighted in Orange on the right side.
Next, click Query in query Editor.

The Query Editor opens where the below attached procedure can be compiled along with the other steps as mentioned below.
Execute the SELECT statement. When the query execution is complete, click and then click CSV.

To export the required DDLs, refer to the below script.

Note:

Please create the below procedure using super user for any schema.

Proc_ddl Download

call ddl_proc();

select * from tmp_ddlDef_t;

Export the contents of “tmp_ddlDef_t” table into a csv file-“Table_definition.csv” and share that in the location provided. This query exports all the DDLs of the tables present in Redshift.

Note:

Likewise, please create the below procedure using super user for any schema.

Proc_view_ddl Download

call tmp_proc_view();
select ddl from tmp_ddlDef;

Export the contents of “tmp_ddlDef” table into a csv file-“View_definition.csv” and share that in the location provided.

Query Execution Logs

Similarly, for extracting the query execution logs, there are two options available. The first option leverages the UNLOAD command while the second option uses the AWS Redshift Query Editor.

Note:

We recommend using the UNLOAD command, especially when the file size is substantial.

Using UNLOAD Command

The UNLOAD command generates the required data file on an S3 bucket. Please see the prerequisites below.

‘s3://bucket_name/path’ – The S3 path where files are expected to be generated.
iam_role ‘arn:aws:iam:::role/’– IAM role of the Redshift cluster.

To export the required query execution logs, use the following UNLOAD command as a reference. Change the start date trunc(sqlog.starttime) between ‘2024-03-29’ and ‘2024-04-01’ as required. All the required inputs are highlighted in the below script.

unload

(

SELECT

‘IDWWM’ || ‘~~’ ||

coalesce(sui.usename,”) || ‘~~’ ||

0 || ‘~~’ ||

‘client_u’ || ‘~~’ ||

NULLIF(sqlog.starttime,’9999-12-31′) ::varchar|| ‘~~’ ||

coalesce(sqm.query_cpu_time,0) ::varchar || ‘~~’ ||

0 || ‘~~’ ||

coalesce(x.byt,0) ::varchar || ‘~~’ ||

coalesce(datediff(millisecond, sc.endtime,sc.starttime),0) ::varchar|| ‘~~’ ||

NULLIF(sqlog.starttime,’9999-12-31′) ::varchar|| ‘~~’ ||

coalesce(sqlog.pid,999) ::varchar|| ‘~~’ ||

coalesce(sqlog.query,999) ::varchar|| ‘~~’ ||

‘NA’ ::VARCHAR|| ‘~~’ ||

coalesce(max(sqm.query_cpu_time) OVER(Partition by sqm.query ORDER BY NULL ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),0)::varchar || ‘~~’ ||

coalesce(x.max_bytes,0)::varchar || ‘~~’ ||

coalesce(sqm.query_execution_time,0) ::varchar || ‘~~’ ||

‘Not Present’ || ‘~~’ ||

ss.sequence::varchar || ‘~~’ ||

coalesce(ss.type,”) || ‘~~’ ||

” || ‘~~’ ||

coalesce(sqlog.label,”) || ‘~~’ ||

ss.text ::VARCHAR(MAX)

from

SVL_STATEMENTTEXT ss

LEFT JOIN

SVL_QLOG sqlog

on ss.xid = sqlog.xid and

ss.pid = sqlog.pid and

ss.userid = sqlog.userid

LEFT JOIN

SVL_QUERY_METRICS sqm

on sqm.query = sqlog.query

LEFT JOIN

svl_compile sc

ON sqm.query = sc.query and

sqlog.pid = sc.pid and

sqlog.xid = sc.pid

Left JOIN

SVL_USER_INFO sui

ON sui.usesysid = sqm.userid

LEFT JOIN

( select query,sum(bytes) as byt,coalesce(max(bytes),0) as max_bytes from SVL_QUERY_SUMMARY

WHERE userid > 1

group by query) X

on sqlog.query = x.query

WHERE

ss.userid > 1

and sqlog.userid > 1

and cast( ss.starttime AS date) between ‘2023-12-01’ and ‘2024-04-30’ $$)

to ‘s3://bucket_name/path’

iam_role ‘arn:aws:iam::<aws acct num>:role/<redshift role>’

DELIMITER ‘|’

GZIP ALLOWOVERWRITE;

Note:

The UNLOAD command runs in parallel which essentially means it generates multiple files in the S3 bucket.

Using AWS Redshift Query Editor

To export the required query execution logs, refer to the below script. Please remember to change the start date trunc(sqlog.starttime) between ‘2024-03-29’ and ‘2024-04-01’; as required.

The total recommended timeframe is of three months for the end-to-end query logs. Next, execute the given script below using the AWS Redshift Query Editor UI and download the output after execution.

SELECT

‘IDWWM’ || ‘~~’ ||

coalesce(sui.usename,”) || ‘~~’ ||

0 || ‘~~’ ||

‘client_u’ || ‘~~’ ||

NULLIF(sqlog.starttime,’9999-12-31′) ::varchar|| ‘~~’ ||

coalesce(sqm.query_cpu_time,0) ::varchar || ‘~~’ ||

0 || ‘~~’ ||

coalesce(x.byt,0) ::varchar || ‘~~’ ||

coalesce(datediff(millisecond, sc.endtime,sc.starttime),0) ::varchar|| ‘~~’ ||

NULLIF(sqlog.starttime,’9999-12-31′) ::varchar|| ‘~~’ ||

coalesce(sqlog.pid,999) ::varchar|| ‘~~’ ||

coalesce(sqlog.query,999) ::varchar|| ‘~~’ ||

‘NA’ ::VARCHAR|| ‘~~’ ||

coalesce(max(sqm.query_cpu_time) OVER(Partition by sqm.query ORDER BY NULL ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),0)::varchar || ‘~~’ ||

coalesce(x.max_bytes,0)::varchar || ‘~~’ ||

coalesce(sqm.query_execution_time,0) ::varchar || ‘~~’ ||

‘Not Present’ || ‘~~’ ||

ss.sequence::varchar || ‘~~’ ||

coalesce(ss.type,”) || ‘~~’ ||

” || ‘~~’ ||

coalesce(sqlog.label,”) || ‘~~’ ||

ss.text ::VARCHAR(MAX)

from

SVL_STATEMENTTEXT ss

LEFT JOIN

SVL_QLOG sqlog

on ss.xid = sqlog.xid and

ss.pid = sqlog.pid and

ss.userid = sqlog.userid

LEFT JOIN

SVL_QUERY_METRICS sqm

on sqm.query = sqlog.query

LEFT JOIN

svl_compile sc

ON sqm.query = sc.query and

sqlog.pid = sc.pid and

sqlog.xid = sc.pid

Left JOIN

SVL_USER_INFO sui

ON sui.usesysid = sqm.userid

LEFT JOIN

( select query,sum(bytes) as byt,coalesce(max(bytes),0) as max_bytes from SVL_QUERY_SUMMARY

WHERE userid > 1

group by query) X

on sqlog.query = x.query

WHERE

ss.userid > 1

and sqlog.userid > 1 and cast( ss.starttime AS date) between ‘2023-12-01’ and ‘2024-04-30’;

Other Database Objects

For better assessment results of your environment and workloads, we recommend exporting additional database objects as well. Please refer to the below script to export the data as separate delimited files.

—-Database objects: we need to run the Below query for every database using Super user credentials

SELECT n.nspname AS schema_name

, CASE WHEN c.relkind = ‘v’ THEN ‘view’ when c.relkind = ‘i’ THEN ‘index’ ELSE ‘table’ END

AS table_type

, count(c.relname)

FROM pg_class As c

LEFT JOIN pg_namespace n ON n.oid = c.relnamespace

LEFT JOIN pg_tablespace t ON t.oid = c.reltablespace

LEFT JOIN pg_description As d

ON (d.objoid = c.oid AND d.objsubid = 0)

WHERE

n.nspname not in (‘information_schema’, ‘pg_catalog’)

group BY n.nspname, CASE WHEN c.relkind = ‘v’ THEN ‘view’ when c.relkind = ‘i’ THEN ‘index’ ELSE ‘table’ END

UNION

SELECT

n.nspname,

‘Stored_procedure’ as table_type

,count(p.prosrc)

FROM

pg_catalog.pg_namespace n

JOIN pg_catalog.pg_proc p ON

pronamespace = n.oid

join pg_catalog.pg_user b on

b.usesysid = p.proowner

where

nspname not in (‘information_schema’,

‘pg_catalog’)

group by n.nspname, table_type

—Databases:

select ‘Databases’,count(*) from pg_database

UNION

select ‘Schemas’,count(*) from pg_namespace where nspname not in (‘information_schema’,

‘pg_catalog’)

—-High Data Volume Tables

WITH tbl_ids AS

(

SELECT DISTINCT oid

FROM pg_class c

WHERE –relowner > 1

relkind = ‘r’

pcon AS

(

SELECT conrelid,

CASE

WHEN SUM(

CASE

WHEN contype = ‘p’ THEN 1

ELSE 0

END

) > 0 THEN ‘Y’

ELSE ‘N’

END pk,

CASE

WHEN SUM(

CASE

WHEN contype = ‘f’ THEN 1

ELSE 0

END

) > 0 THEN ‘Y’

ELSE ‘N’

END fk,

conname

FROM pg_constraint

WHERE conrelid > 0

AND conrelid IN (SELECT oid FROM tbl_ids)

GROUP BY conrelid,conname

)

SELECT

database

,SCHEMA as schemaname

,”table” AS tablename

,tbl_rows as num_rows

,size AS size_mb

,pcon.pk

,pcon.conname

FROM

svv_table_info ti

LEFT JOIN pcon ON pcon.conrelid = ti.table_id

WHERE ti.SCHEMA !~ ‘^information_schema|catalog_history|pg_’ and size_mb = 10000

—Data for Partitioning / Bucketing

SELECT

database

,SCHEMA as schemaname

,”table” AS table_name

,size AS size_mb

,tbl_rows as num_rows

,pg.attname column_name

,” as num_unique_values

FROM

svv_table_info ti

inner JOIN pg_attribute pg ON pg.attrelid = ti.table_id

WHERE ti.SCHEMA !~ ‘^information_schema|catalog_history|pg_’

—Count of stored procedures and Functions

select database_name,schema_name,function_type,count(*) from SVV_REDSHIFT_FUNCTIONS WHERE schema_name !~ ‘^information_schema|catalog_history|pg_’

group by database_name,schema_name,function_type

—List of stored procedure

select

database_name

,schema_name

,function_type

,function_name

from

SVV_REDSHIFT_FUNCTIONS

where

schema_name not in (‘information_schema’,

‘pg_catalog’)

—Count of external Tables/Views in Redshift

select

redshift_database_name

,schemaname

,tabletype

,count(tablename)

from

SVV_EXTERNAL_TABLES

group by

redshift_database_name

,schemaname

,tabletype

—List of External Tables

select

redshift_database_name

,schemaname

,tabletype

tablename

from

SVV_EXTERNAL_TABLES

—Total I/O Usage by Days

SELECT trunc(start_time) as RUNDATE,sum(local_read_IO+remote_read_IO)as TOTALIOREADS

FROM SYS_QUERY_DETAIL

where trunc(start_time) between ‘2024-03-28’ and ‘2024-04-03’

group by trunc(start_time)

Note: ‘2024-03-28’ and ‘2024-03-31’ Please change the Date range To the period of High usage and of minimum 30 Days.

—-Database Volume:

v_space_used_per_tbl

We need to create the above view in any schema and then execute the Below query to pull the details.

SELECT

dbase_name

,schemaname

,SUM(megabytes) as total_mb

FROM

public.v_space_used_per_tbl

GROUP BY

dbase_name

,schemaname

–Distinct application name

select distinct application_name
from pg_catalog.stl_connection_log where (recordtime between ‘2024-05-01’ and ‘2024-05-30’ )
and application_name is not null

–Distinct client ID

select distinct client_id from pg_catalog.stl_network_throttle where (log_time between ‘2024-05-01’ and ‘2024-05-30‘ )

— The date range is subject to change – as needed for query log assessment.— (15day/1month/6months whatever is possible for extraction)

Stored Procedure and Function DDL Extraction

To export the required definitions related to Stored Procedure and Function, refer to the below script. It exports all the DDLs present in Redshift.

Next, export the results into a CSV file and name it as “Stored_procedure_function_definition.csv”

Redshift_StoredProcedures_Functions_Extraction_Script Download

Note:

Please use Super user role for running the scripts provided above.

Other Scripts and Artifacts

Copy any other scripts such as DML scripts etc. from your GIT and share them with the LeapLogic team to produce more extensive insights.

Getting Help

Contact LeapLogic technical support at info@leaplogic.io