SQL TABLE PI DATA CLEANING SOLUTION (PART 2)

In Part One I set up the solution that cleans our data. In this part I will expand the solution to catalog tables and identify columns that may contain PI data and activate ones I want cleaned when a database is RESTORED to QA/DEV.

I installed the following FUNCTION on all my production databases, so the cleaning solution would already have access to it once RESTORED. Alternatively, I could just install it on the databases AFTER RESTORE each week, but that would mean another weekly task to perform.


-- ================  CREATE ON ALL DATRABASES ========================

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author:		David Speight
-- Create date: 20230510
-- Description:	Returns the Primary Key COLUMN NAME of a table in the current database.
-- =============================================
CREATE FUNCTION dbo.TablePrimaryKeyColumnNameGet
(
@TableName varchar(250),
@SchemaName varchar(100)
)
RETURNS varchar(250)
AS
BEGIN
DECLARE @ColumnIDName varchar(250);

SELECT @ColumnIDName = c.name
FROM sys.tables t 
INNER JOIN sys.schemas s on t.schema_id = s.schema_id
INNER JOIN sys.indexes ix on t.object_id = ix.object_id
INNER JOIN sys.index_columns ixc ON ix.object_id = ixc.object_id and ix.index_id = ixc.index_id
INNER JOIN sys.columns c ON ixc.object_id = c.object_id  and ixc.column_id = c.column_id
WHERE ix.type > 0 and ix.is_primary_key = 1
	and t.name = @TableName
	and s.name = @SchemaName;

RETURN @ColumnIDName;
END

GO

I added this UNIQUE INDEX to the CleaningColumns table, just to make sure I don’t get duplicates.


-- Ensure only ONE entry for each COLUMN is allowed by setting up this UNIQUE index.

CREATE UNIQUE NONCLUSTERED INDEX [IX_Unique_ColumnCombo] ON [dbo].[CleaningColumns]
(
	[DatabaseName] ASC,
	[SchemaName] ASC,
	[TableName] ASC,
	[ColumnName] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

TEMPORAL TABLES

IN Part 1, I mentioned how I have to handle TEMPORAL tables differently. I want to make sure I’ve captured a list of all temporal tables in each database. I’ve created this StoredProc to do that.


/****** Object:  StoredProcedure [dbo].[TemporalListAddNew]  ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER     PROCEDURE [dbo].[TemporalListAddNew]
AS
BEGIN

	-- ADD NEW
	DECLARE @iCount INT = 0;
	DECLARE @tSql varchar(4000);
	DECLARE @DatabaseName varchar(250);
	SELECT @DatabaseName = ''

	DECLARE @tDatabases AS TABLE (	[ID] [int] IDENTITY(1,1) NOT NULL,
		[DatabaseName] [varchar](200) NULL);

	INSERT INTO @tDatabases (DatabaseName)
	SELECT DISTINCT DatabaseName
	FROM dbo.CleaningColumns
	WHERE  IsActive=1;

	SELECT @iCount = MAX(ID) FROM @tDatabases;

	WHILE @iCount > 0
	BEGIN
		SELECT @DatabaseName = DatabaseName
		FROM @tDatabases
		WHERE ID = @iCount;

		SELECT @tSql = CONCAT(
		'INSERT INTO dbo.CleaningTemporal (DatabaseName,TableSchema,TableName,HistorySchema,HistoryTableName,IsActive)
		SELECT ''',@DatabaseName,''', schema_name(t.schema_id), t.name, schema_name(h.schema_id), h.name,1
		FROM ',@DatabaseName,'.sys.tables t inner join ',@DatabaseName,'.sys.tables h on t.history_table_id = h.object_id
		WHERE t.temporal_type = 2
		and not exists (SELECT * 
				FROM dbo.CleaningTemporal
				WHERE	DatabaseName = ''',@DatabaseName,'''
				AND	TableSchema = schema_name(t.schema_id)
				AND TableName = t.name
				AND HistorySchema = schema_name(h.schema_id)
				AND HistoryTableName = h.name)')
		EXEC(@tSql)

		SELECT @iCount = @iCount - 1, @DatabaseName = '', @tSql = '';
	END;

	SELECT * FROM dbo.CleaningTemporal;
END

Now I will create a table where I can store a list of objects I may wish to validate. This will basically be a catalog of all tables and columns in each database on my instance.


SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[Validator](
	[ID] [int] IDENTITY(1,1) NOT NULL,
	[DatabaseName] [varchar](250) NULL,
	[SchemaName] [varchar](150) NULL,
	[TableName] [varchar](250) NULL,
	[ColumnName] [varchar](250) NULL,
	[ObjectTypeName] [varchar](50) NULL,
	[Status] [varchar](50) NULL,
	[DataType] [varchar](50) NULL,
	[GeneratorDataType] [varchar](50) NULL,
	[IsActive] [bit] NULL,
	[dtCreated] datetime NULL,
	[dtLastModified]  datetime NULL,
 CONSTRAINT [PK_Validator] PRIMARY KEY CLUSTERED 
(
	[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

ALWAYS EXCLUDE

UPDATE: The code in GIT will include my most recent addition of a VALIDATOR ALWAYS IGNORE table. This will allow me to use the PREVIEW and GENERATOR to REFRESH the column list and ALWAYS ignore objects in this table, so they won’t have to be reevaluated each time.


/****** Object:  Table [dbo].[ValidatorAlwaysIgnore] ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[ValidatorAlwaysIgnore](
	[ID] [int] IDENTITY(1,1) NOT NULL,
	[DatabaseName] [varchar](250) NULL,
	[SchemaName] [varchar](150) NULL,
	[TableName] [varchar](250) NULL,
	[ColumnName] [varchar](250) NULL,
	[dtCreated] [datetime] NULL,
 CONSTRAINT [PK_ValidatorAlwaysIgnore] PRIMARY KEY CLUSTERED 
(
	[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO

Next, I will populate the VALIDATOR table with a list of databases and mark EXCLUDE for the ones that do not have PI data, or I don’t want cleaned. MY EXCLUDE LIST identifies ALL Third Party application databases as well, such as Jira, Hangfire, etc. In this sample, I’ve also marked to exclude any database used by AWS RDS, SSRS, and GDPR as well.


INSERT INTO dbo.Validator
([DatabaseName], [SchemaName], [TableName], [ColumnName], [ObjectTypeName], [Status], [DataType], [IsActive], [dtCreated], [dtLastModified])
select name, null, null, null, 'database', 'parent', null, 1, getutcdate(), getutcdate()
from sys.databases;

UPDATE dbo.Validator
SET status = 'exclude'
WHERE DatabaseName in ('master', 'tempdb', 'model','msdb', 'DB_Clean_DataGeneration','Audit','AuditDW','Hangfire','Jiradb','Maintenance','ServerAnalysis','ServerAnalysisDW','Test','ServerAnalysisDWTestData','UPIC_Mobile','SSISDB','ReportServer','ReportServerTempDB')
or DatabaseName  like 'GDPR%' or DatabaseName  like 'rds%';

Here I will populate the VALIDATOR with a list of TABLES from the databases I just cataloged.


DECLARE @tblLoop INT = 0;
DECLARE @tblSql nvarchar(3000);
DECLARE @dbTable as TABLE (ID INT IDENTITY(1,1), DatabaseName VARCHAR(250))

INSERT INTO @dbTable
SELECT DatabaseName
FROM dbo.Validator
WHERE ObjectTypeName = 'database' and IsActive=1 and Status != 'exclude'

SELECT @tblLoop =  MAX(ID) FROM @dbTable;

WHILE @tblLoop > 0
BEGIN
	SELECT @tblSql = CONCAT('INSERT INTO dbo.Validator
		([DatabaseName], [SchemaName], [TableName], [ColumnName], [ObjectTypeName], [Status], [DataType], [IsActive], [dtCreated], [dtLastModified])
	SELECT ''',DatabaseName,''', sc.name, tbl.name, null, ''table'',''parent'',null,1,getutcdate(),getutcdate()  
	FROM  ',DatabaseName,'.sys.tables tbl 
	inner join ',DatabaseName,'.sys.schemas sc on tbl.schema_id = sc.schema_id
	WHERE NOT EXISTS (
		SELECT ''',DatabaseName,''', sc.name, tbl.name, null, ''table'' 
		FROM  DB_Clean_DataGeneration.dbo.Validator v
		INNER JOIN ',DatabaseName,'.sys.tables tbl on  v.TableName = tbl.name
		INNER JOIN ',DatabaseName,'.sys.schemas sc on v.SchemaName = sc.name
		WHERE v.DatabaseName = ''',DatabaseName,'''														
												)								
FROM @dbTable
WHERE ID = @tblLoop;

BEGIN TRY
	EXEC(@tblSql)
END TRY
BEGIN CATCH
	PRINT @tblSql;
END CATCH;

-- reset variables
SELECT	@tblLoop = @tblLoop - 1, @tblSql ='';
END;

To populate the VALIDATOR with a list of COLUMNS from the tables I wish to consider, I’ve written a Stored PROC that will handle this. My proc may also be reused in the future to identify and catalog new columns I’ve added to the databases.


/****** Object:  StoredProcedure [dbo].[GeneratorUpdateBaseColumns]    Script Date: 5/19/2023 8:48:37 AM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

ALTER   PROCEDURE [dbo].[GeneratorUpdateBaseColumns] 
AS
BEGIN 

/*
Remove columns where the database is exclude
*/

DELETE FROM dbo.Validator
WHERE ObjectTypeName in ('table','column')
and DatabaseName in (
	SELECT DatabaseName 
	FROM dbo.Validator 
	WHERE ObjectTypeName = 'database' and status = 'exclude');

/*
Check for new TABLES from our ACTIVE databases we may wish to consider
*/

DECLARE @tblLoop INT = 0;
DECLARE @tblSql nvarchar(3000);
DECLARE @dbTable as TABLE (ID INT IDENTITY(1,1), DatabaseName VARCHAR(250))

INSERT INTO @dbTable
SELECT DatabaseName
FROM dbo.Validator
WHERE ObjectTypeName = 'database' and IsActive=1 and Status != 'exclude'

SELECT @tblLoop =  MAX(ID) FROM @dbTable;

WHILE @tblLoop > 0
BEGIN
SELECT @tblSql = CONCAT('INSERT INTO dbo.Validator
([DatabaseName], [SchemaName], [TableName], [ColumnName], [ObjectTypeName], [Status], [DataType], [IsActive], [dtCreated], [dtLastModified])
	SELECT ''',DatabaseName,''', sc.name, tbl.name, null, ''table'',''parent'',null,1,getutcdate(),getutcdate()  
	FROM  ',DatabaseName,'.sys.tables tbl 
	inner join ',DatabaseName,'.sys.schemas sc on tbl.schema_id = sc.schema_id
	left join DB_Clean_DataGeneration.dbo.Validator v on tbl.name = v.TableName and sc.name = v.SchemaName
	WHERE v.TableName is null;')
FROM @dbTable
WHERE ID = @tblLoop;

BEGIN TRY
	EXEC(@tblSql)
END TRY
BEGIN CATCH
	PRINT @tblSql;
END CATCH;

-- reset variables
SELECT	@tblLoop = @tblLoop - 1, @tblSql ='';
END;

/*
Populate the VALIDATOR with a list of COLUMNS from the tables we wish to consider
*/

DECLARE @cLoop INT = 0;
DECLARE @cSql nvarchar(3000);
DECLARE @cdbTable as TABLE (ID INT IDENTITY(1,1), DatabaseName VARCHAR(250))

INSERT INTO @cdbTable
SELECT DatabaseName
FROM dbo.Validator
WHERE ObjectTypeName = 'database' and IsActive=1 and Status != 'exclude'

SELECT @cLoop =  MAX(ID) FROM @cdbTable;

WHILE @cLoop > 0
BEGIN
	SELECT @cSql =CONCAT('INSERT INTO dbo.Validator
		([DatabaseName], [SchemaName], [TableName], [ColumnName], [ObjectTypeName], [Status], [DataType], [IsActive], [dtCreated], [dtLastModified])
		SELECT ''',DatabaseName,''', sc.name,  tbl.name, c.name,  ''column'',''standard'',
			(CASE	WHEN t.Name in (''decimal'',''numeric'') THEN CONCAT(t.Name,''('',c.max_length,'','', c.precision,'')'')
			WHEN t.Name in (''sql_variant'',''hierachyid'',''varbinary'',''varchar'',''binary'',''char'',''nvarchar'',''nchar'') THEN CONCAT(t.Name,''('',REPLACE(c.max_length,''-1'',''max''),'')'')
			ELSE t.Name END),1,getutcdate(),getutcdate()
		FROM ',DatabaseName,'.sys.tables tbl
		INNER JOIN ',DatabaseName,'.sys.schemas sc ON Tbl.schema_id = Sc.schema_id
		INNER JOIN ',DatabaseName,'.sys.columns c ON tbl.object_id = c.object_id
		INNER JOIN ',DatabaseName,'.sys.types t ON c.user_type_id = t.user_type_id
		LEFT JOIN DB_Clean_DataGeneration.dbo.tfnValidatorTableExcluded(''',DatabaseName,''') fn ON fn.SchemaName = sc.name and fn.TableName = tbl.name
		LEFT JOIN DB_Clean_DataGeneration.dbo.Validator v  ON v.SchemaName = sc.name and v.TableName = tbl.name and v.ColumnName = c.name
		WHERE fn.TableName IS NULL
		and v.ColumnName is null;')
	FROM @cdbTable
	WHERE ID = @cLoop;

	BEGIN TRY
		EXEC(@cSql)
	END TRY
	BEGIN CATCH
		PRINT @cSql;
	END CATCH;

	-- reset variables
	SELECT	@cLoop = @cLoop - 1, @cSql ='';
END;

END
GO

(OPTIONAL) In my case I began this process as described in PART ONE and made manually entries for cleaning. So when this is a “build upon step”.)” I will need to identify the manual entries I already have, in the new catalog.


SELECT *
FROM dbo.CleaningColumns c
inner join dbo.Validator v on C.DatabaseName = v.DatabaseName 
	AND c.SchemaName = v.SchemaName
	AND c.TableName = v.TableName
	AND c.ColumnName = v.ColumnName

UPDATE v
SET Status = (Case when v.IsActive = 1 then 'cleaning' else 'clean-inactive' end)
FROM dbo.Validator  AS v
INNER JOIN dbo.CleaningColumns AS c
ON v.DatabaseName  = c.DatabaseName
AND v.SchemaName = C.SchemaName
AND v.TableName = c.TableName
AND v.ColumnName = c.ColumnName

Now I want to IDENTIFY possible COLUMN NAMES that could contain PI data. I have written a stored proc to help PREVIEW the items that might need to be added to the cleaning soilution.

CREATE OR ALTER PROCEDURE dbo.GeneratePreview
AS
BEGIN

	DECLARE @LookupType as Table (ID INT NOT NULL IDENTITY(1,1), DataTypeName VARCHAR(50), Keyword VARCHAR(50), IsInt bit);

	INSERT INTO @LookupType
	select c.DataTypeName, s.value, c.IsInt
	FROM dbo.CleaningDataType c
	CROSS APPLY string_split(c.keywordList,',') s 
	WHERE c.IsActive=1 and LEN(S.value) > 1;

	SELECT *
	FROM 
		(
		SELECT 
			v.ID, v.DatabaseName,v.SchemaName,v.TableName,v.ColumnName,v.ObjectTypeName,v.Status,
			v.DataType, v.GeneratorDataType, l.DataTypeName, l.Keyword, l.Isint,
		ROW_NUMBER() OVER(PARTITION BY v.DatabaseName,v.SchemaName,v.TableName,v.ColumnName ORDER BY v.DatabaseName,v.SchemaName,v.TableName,v.ColumnName) As ROWID
		FROM dbo.Validator v
		INNER JOIN @LookupType l 
			ON (v.ColumnName LIKE l.Keyword+'%')
			and (l.isint = 1 or LEFT(v.datatype,7) NOT IN (
'bigint','binary','bit','decimal','float','geograp','geometr','hierarc','image','int','money','numeric','real','smallin','smallmo','sql_var','tinyint'
		)) 
		WHERE v.IsActive=1 and v.ObjectTypeName='column' and v.Status = 'standard'
		) q
	WHERE q.Rowid = 1
	ORDER BY DatabaseName,SchemaName,TableName,ColumnName;

END
GO

I then REVIEW and EXCLUDE ANY COLUMNS NOT NEEDED.

EXEC dbo.GeneratePreview;

From the results, I identify the items I DO NOT want cleaned and then I add these to my catalog and mark them, using a stored proc, where I can enter a string of ID numbers found from the results.



CREATE OR ALTER PROCEDURE dbo.ColumnsToIgnore
@IdList varchar(8000)
AS
BEGIN

	UPDATE dbo.Validator
	SET Status = 'ignore'
	WHERE ID IN (
				SELECT value
				FROM string_split(@IdList,','));

END
GO


EXEC dbo.ColumnsToIgnore (a,b,c,d,....);  -- Where a,b,c are ID numbers.

ACTIVATE PREVIEW

I rerun EXEC dbo.GeneratePreview and re-verify the list is correct. Then, I run this stored procedure that executes similar code, but actually transfers the object to the REAL cleaning table.


CREATE OR ALTER PROCEDURE dbo.GenerateSuggestions
AS
BEGIN


DECLARE @LookupType as Table (ID INT NOT NULL IDENTITY(1,1), DataTypeName VARCHAR(50), Keyword VARCHAR(50), IsInt bit);
DECLARE @StartID INT;
SELECT @StartID = MAX(ID)
FROM  dbo.CleaningColumns;

	INSERT INTO @LookupType
	select c.DataTypeName, s.value, c.IsInt
	FROM dbo.CleaningDataType c
	CROSS APPLY string_split(c.keywordList,',') s 
	WHERE c.IsActive=1 and LEN(S.value) > 1;

	INSERT INTO dbo.CleaningColumns
	([DatabaseName], [SchemaName], [TableName], [ColumnName], [GenderRefColumnName], [IDColumnName], [DataType], [IsActive], [UsesGenerationDb])
	SELECT q.DatabaseName,q.SchemaName,q.TableName,q.ColumnName,NULL,NULL,DataTypeName,1,1
	FROM 
		(
		SELECT 
			v.ID, v.DatabaseName,v.SchemaName,v.TableName,v.ColumnName,v.ObjectTypeName,v.Status,
			v.DataType, v.GeneratorDataType, l.DataTypeName, l.Keyword, l.Isint,
			ROW_NUMBER() OVER(PARTITION BY v.DatabaseName,v.SchemaName,v.TableName,v.ColumnName ORDER BY v.DatabaseName,v.SchemaName,v.TableName,v.ColumnName) As ROWID
		FROM dbo.Validator v
			INNER JOIN @LookupType l 
				ON (v.ColumnName LIKE l.Keyword+'%')
				and (l.isint = 1 or LEFT(v.datatype,7) NOT IN (
				'bigint','binary','bit','decimal','float','geograp','geometr','hierarc','image','int','money','numeric','real','smallin','smallmo','sql_var','tinyint'
				)) 
		WHERE v.IsActive=1 and v.ObjectTypeName='column' and v.Status = 'standard'
		) q
	-- filter out rows already in our table
	LEFT JOIN dbo.CleaningColumns c ON 
			q.DatabaseName = c.DatabaseName and
			q.SchemaName = c.SchemaName and
			q.TableName = c.TableName and
			q.ColumnName = c.ColumnName
	WHERE c.ColumnName IS NULL and q.Rowid = 1	
	ORDER BY DatabaseName,SchemaName,TableName,ColumnName;





		-- Lookup ID Column  (Mark missing/multiple CREATED WITHOUT ID COLUMN REVIEW)

		DECLARE @PkIdUpdateSql varchar(3000);
		DECLARE @iTableLoop int;

		SELECT @iTableLoop = 1 + COUNT(*) 
		FROM
			( SELECT COUNT(*) As TableCount from dbo.CleaningColumns where IDColumnName is null GROUP BY DatabaseName, SchemaName, TableName )q
		
		While @iTableLoop > 0
		BEGIN
			SELECT @PkIdUpdateSql = CONCAT(
					'UPDATE c
					SET IDColumnname = q.IDColumnname		
					FROM dbo.CleaningColumns c
					INNER JOIN 
					(SELECT ''',DatabaseName,''' as DatabaseName, ''',TableName,''' as TableName, ''',SchemaName,''' as SchemaName,
							ISNULL(',DatabaseName,'.dbo.TablePrimaryKeyColumnNameGet(''',TableName,''',''',SchemaName,'''),''ID'') as IDColumnname) q
					ON c.DatabaseName = q.DatabaseName and c.TableName = q.TableName and c.SchemaName = q.SchemaName
					')
			FROM dbo.CleaningColumns 
			WHERE IDColumnName is null;

			EXEC(@PkIdUpdateSql);
			SELECT @iTableLoop = @iTableLoop -1, @PkIdUpdateSql='';
		END;

		

	-- Lookup Gender Column  (Mark multiple CREATED BUT MULTI GENDER COLUMNS FOUND - REVIEW)
	UPDATE c
	SET GenderRefColumnName = z.GenderRefColumnName
	FROM dbo.CleaningColumns c
	INNER JOIN 
				(SELECT v.DatabaseName, v.SchemaName, v.TableName, v.ColumnName, 
					( CASE COUNT(*) WHEN 1 THEN v.ColumnName ELSE 'MULTI-FOUND' END) As GenderRefColumnName
				FROM dbo.Validator v
				INNER JOIN (
							SELECT DatabaseName, SchemaName, TableName 
							FROM dbo.CleaningColumns 
							WHERE (ColumnName like '%name%') and (GenderRefColumnName is null or LEN(GenderRefColumnName) < 2)
							GROUP BY DatabaseName, SchemaName, TableName) q
						ON v.DatabaseName = q.DatabaseName and  v.SchemaName = q.SchemaName and v.TableName = q.TableName
				WHERE V.ColumnName LIKE '%GENDER%' or V.ColumnName LIKE '%SEX%'
				GROUP BY v.DatabaseName, v.SchemaName, v.TableName, v.ColumnName) z
		ON c.DatabaseName = z.DatabaseName and  c.SchemaName = z.SchemaName and c.TableName = z.TableName
 WHERE (c.GenderRefColumnName is null or LEN(c.GenderRefColumnName) < 2)
	and (c.ColumnName like '%name%');
				
select * FROM dbo.CleaningColumns  WHERE  LEN(GenderRefColumnName) > 1;

	UPDATE v
	SET Status = (Case when v.IsActive = 1 then 'cleaning' else 'clean-inactive' end)
	FROM dbo.Validator  AS v
		INNER JOIN dbo.CleaningColumns AS c
			ON v.DatabaseName  = c.DatabaseName
			AND v.SchemaName = C.SchemaName
			AND v.TableName = c.TableName
			AND v.ColumnName = c.ColumnName;
	SELECT *
	FROM  dbo.CleaningColumns
	WHERE ID > @StartID
	ORDER BY DatabaseName, SchemaName, TableName, ColumnName;

END
GO

POST CLEANING STEPS

I have certain tables that need to be UPDATED to conform to our QA environment. Therefore, I created a POST table that can store Tsql code to run AFTER the data cleaning. One thing I do, is EXCLUDE all AUDIT tables and then use this feature to TRUNCATE those tables.

TABLE

PROCEDURE



USE [DB_Clean_DataGeneration]
GO
/****** Object:  StoredProcedure [dbo].[PostCleaning]    Script Date: 5/19/2023 11:18:35 AM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

ALTER   PROCEDURE [dbo].[PostCleaning]
AS
BEGIN
	DECLARE @tSql varchar(max);
	DECLARE @iCount int = 0;
	DECLARE @isActive bit = 0;
	DECLARE @ParentAuditKey int;
	DECLARE @PkgGUID uniqueidentifier;

	SELECT @PkgGUID = NEWID();
	SELECT @iCount = MAX(ID) FROM dbo.PostCleaningCustomCode;

	INSERT INTO dbo.DimAudit ( [ParentAuditKey], [TableName], [PkgName], [PkgGUID], [ExecStartDT], [SuccessfulProcessingInd])
	VALUES (1, 'DATA CLEANING: PostClean-Start','POST CLEANING: PostClean-Start', @PkgGUID, getutcdate(), 'P');
	SELECT @ParentAuditKey = SCOPE_IDENTITY();
	UPDATE dbo.DimAudit  SET ParentAuditKey = @ParentAuditKey WHERE AuditKey = @ParentAuditKey;

	WHILE @iCount > 0
		BEGIN
			SELECT  @isActive = isActive, 
					@tSql = Tsql
			FROM	dbo.PostCleaningCustomCode
			WHERE	ID = @iCount;

			BEGIN TRY
				EXEC(@tSql);
			END TRY
			BEGIN CATCH
				PRINT @tSql;
				INSERT INTO dbo.DimAudit ( [ParentAuditKey], [TableName], [PkgName], [PkgGUID], [ExecStartDT], [SuccessfulProcessingInd])
				VALUES (@ParentAuditKey, 'DATA CLEANING: PostClean-Step',CONVERT(varchar(12),@iCount), @PkgGUID, getutcdate(), 'N');
			END CATCH;

			SELECT @iCount= @iCount-1, @isActive=0, @tSql=''
		END

		-- Success / Failure Actions
		IF ((SELECT COUNT(*) FROM dbo.DimAudit WHERE ParentAuditKey=@ParentAuditKey AND SuccessfulProcessingInd='N')=0)
		BEGIN
			UPDATE dbo.DimAudit  
			SET ExecStopDT = getutcdate(),SuccessfulProcessingInd = 'Y' 
			WHERE	(AuditKey = @ParentAuditKey );
			-- ELSE send email with Failed Steps table
		END
		ELSE
		BEGIN
			UPDATE dbo.DimAudit  
			SET ExecStopDT = getutcdate(),SuccessfulProcessingInd = 'N' 
			WHERE	(AuditKey = @ParentAuditKey );
		END

END
GO

SET UP AGNET JOB and TEST

Finally, I set up my SQL Agent job that will run AFTER the database RESTORE job completes.

I include the following steps in my Agent job.

  1. Restore Databases.
  2. EXEC dbo.DataCleaning;
  3. EXEC dbo.PostCleaning;
  4. Delete the Restore files.

GIT PROJECT

The above project can be found in my GIT repository as a Visual Studio (2019) project. You may still use the scripted objects (tables, views, sprocs) in the dbo folder even if you don’t use VS.

https://github.com/LoveTheSql/DB_Clean_DataGeneration/tree/main

End of post.