TSQL Minimum Result Set

Creating a Stored Procedure that Only Returns Rows When a Minimum Row Count is Met

When working with a medical/clinical research de-identified dataset, we often want to allow researchers to freely explore the de-identified data across many tables. However, when a de-identified query returns too few results, there is a concern of possible re-identification.

For example, consider the query:

select * from healthcare_data where diagnosis = 'something rare' and zipcode = 'some place small'

If such a query returns only 1 or 2 rows, then it might be possible to re-identify the patient. To avoid this issue systems will typically just show the text "less than 11 results" when the underlying query has 10 or fewer results.

The challenge is how to address this issue when the researcher is just using SQL Server Management Studio (SSMS) to run queries, and not a third-party research/query tool.

Solution

The stored procedure shown below will allow a user to pass any query text to the stored procedure. If the result set contains 10 or fewer rows the text "RESULT SET CONTAINS LESS THAN 11 ROWS", otherwise the full query results are returned.

Note the naming conventions and account configuation below may vary based on your orginization's policies.

All syntax is TSQL, specific to SQL Server.

Security Configuration

Consider a database named "ResearchDB", containg many tables of de-identifed research data.

Create a SQL User named db_read_only that has db_datareader role on ResearchDB. This could be an Active Directory account too. Make sure the password is sufficiently complex, as it will never be entered by a user and only used by the stored procedure.

The researcher will presumably have an Active Directory SQL account. Assign this user to the public role on ResearchDB, and then only grant them execute permission to the EXEC_SQL procedure after it is created in the next step. Note that they do not have read access to ResearchDB.

Stored Procedure

Create the stored procedure below named EXEC_SQL in ResearchDB.

Note that the procedure has the "WITH EXECUTE AS" clause, which allows it read access to ResearchDB, even though the caller only has execute access to the stored procedure.

The stored procedure defines a global temp table (global for scope), modifies the incoming SQL statement to insert into the temp table, counts the rows in the temp table, and only ouputs the rows if the count is sufficient.

SQL statements other than "select" are prohibited by the fact user db_read_only can only perform database reads.

CREATE PROCEDURE [dbo].[EXEC_SQL](@sql varchar(max)) WITH EXECUTE AS 'db_read_only'
AS
BEGIN  
 BEGIN TRY    
 SET NOCOUNT ON;  
 declare @sqlinternal as nvarchar(max);  
 declare @sqlcount as nvarchar(max);            
 --define global temp table, with arbitrary name  
 declare @temp as varchar(32) = '##TEMP' + replace((select NEWID()), '-', '')  
 --insert to global temp table  
 select @sqlinternal = 'with tmp as (' + @sql + ') select * into ' + @temp + ' from tmp'  
 exec(@sqlinternal);       
 --get row count from temp table  
 select @sqlcount = N'select @countout=count(*) from ' + @temp;  
 declare @count as int;  
 declare @Params nvarchar(100) = N'@countout int output'  
 exec sp_executesql @sqlcount, @Params, @countout=@count output  
 --provide data only if rc > 10  
 if (@count > 10)    
 begin  
  exec(@sql);  
 end  
 else  
 begin  
  print 'RESULT SET CONTAINS LESS THAN 11 ROWS'  
 end  
 --drop global temp table  
 exec('drop table ' + @temp);  
 END TRY   
 BEGIN CATCH   
  DECLARE @ErrorMsg AS VARCHAR(MAX) = ERROR_MESSAGE();  
  PRINT @ErrorMsg;  
 END CATCH  
END

Limitations

It's possible the user could generate an input SQL statement that is not compatible with the CTE wrapper used in this logic.

Notably, writing SQL as plain text without an editor is not a preferred method.

Conclusion

There are use cases where we only want to return the results of a SQL select if a sufficient number of rows are present, in order to prevent possible re-identification of data. This can be achieved by implementing a stored procedure to process the input SQL and modify the output as needed.