When working with a medical/clinical research de-identified dataset, we often want to allow researchers to freely explore the de-identified data across many tables. However, when a de-identified query returns too few results, there is a concern of possible re-identification.
For example, consider the query:
select * from healthcare_data where diagnosis = 'something rare' and zipcode = 'some place small'
If such a query returns only 1 or 2 rows, then it might be possible to re-identify the patient. To avoid this issue systems will typically just show the text "less than 11 results" when the underlying query has 10 or fewer results.
The challenge is how to address this issue when the researcher is just using SQL Server Management Studio (SSMS) to run queries, and not a third-party research/query tool.
The stored procedure shown below will allow a user to pass any query text to the stored procedure. If the result set contains 10 or fewer rows the text "RESULT SET CONTAINS LESS THAN 11 ROWS", otherwise the full query results are returned.
Note the naming conventions and account configuation below may vary based on your orginization's policies.
All syntax is TSQL, specific to SQL Server.
Consider a database named "ResearchDB", containg many tables of de-identifed research data.
Create a SQL User named db_read_only that has db_datareader role on ResearchDB. This could be an Active Directory account too. Make sure the password is sufficiently complex, as it will never be entered by a user and only used by the stored procedure.
The researcher will presumably have an Active Directory SQL account. Assign this user to the public role on ResearchDB, and then only grant them execute permission to the EXEC_SQL procedure after it is created in the next step. Note that they do not have read access to ResearchDB.
Create the stored procedure below named EXEC_SQL in ResearchDB.
Note that the procedure has the "WITH EXECUTE AS" clause, which allows it read access to ResearchDB, even though the caller only has execute access to the stored procedure.
The stored procedure defines a global temp table (global for scope), modifies the incoming SQL statement to insert into the temp table, counts the rows in the temp table, and only ouputs the rows if the count is sufficient.
SQL statements other than "select" are prohibited by the fact user db_read_only can only perform database reads.
CREATE PROCEDURE [dbo].[EXEC_SQL](@sql varchar(max)) WITH EXECUTE AS 'db_read_only' AS BEGIN BEGIN TRY SET NOCOUNT ON; declare @sqlinternal as nvarchar(max); declare @sqlcount as nvarchar(max); --define global temp table, with arbitrary name declare @temp as varchar(32) = '##TEMP' + replace((select NEWID()), '-', '') --insert to global temp table select @sqlinternal = 'with tmp as (' + @sql + ') select * into ' + @temp + ' from tmp' exec(@sqlinternal); --get row count from temp table select @sqlcount = N'select @countout=count(*) from ' + @temp; declare @count as int; declare @Params nvarchar(100) = N'@countout int output' exec sp_executesql @sqlcount, @Params, @countout=@count output --provide data only if rc > 10 if (@count > 10) begin exec(@sql); end else begin print 'RESULT SET CONTAINS LESS THAN 11 ROWS' end --drop global temp table exec('drop table ' + @temp); END TRY BEGIN CATCH DECLARE @ErrorMsg AS VARCHAR(MAX) = ERROR_MESSAGE(); PRINT @ErrorMsg; END CATCH END
It's possible the user could generate an input SQL statement that is not compatible with the CTE wrapper used in this logic.
Notably, writing SQL as plain text without an editor is not a preferred method.
There are use cases where we only want to return the results of a SQL select if a sufficient number of rows are present, in order to prevent possible re-identification of data. This can be achieved by implementing a stored procedure to process the input SQL and modify the output as needed.