Skip to main content

SOP for Privacy Scan Tool Operations

Version History

Purpose

This Standard Operating Protocol (SOP) outlines the procedures for using the privacy scan tool to identify and manage potential privacy risks in datasets used within the CHORUS project. It is intended for project team members who are responsible for data p rivacy assessments, risk mitigation, and compliance with privacy standards.

Procedures

STEP 1: INSTALL PRIVACY SCAN TOOL

  • Ensure the necessary dependencies are installed:
    • Python 3.6 or higher.
    • Required Python libraries from requirements.txt in the repository.
    • Git for repository cloning.
  • Clone the Privacy Scan Tool repository from GitHub:
  • install the dependencies by running:
    • pip install -r requirements.txt

STEP 2: PREPARE DATASETS FOR SCANNING

  • Ensure datasets are in a suitable format (CSV, JSON, or database connection).
  • Anonymize or pseudonymize sensitive fields if necessary before scanning.
  • Load the dataset into the Privacy Scan Tool’s input directory or configure a connection string for direct database access.

STEP 3: CONFIGURE THE SCAN TOOL

  • Adjust the tool’s configuration to match the dataset and privacy rules:
    • Modify the config.yaml file to set parameters such as:
      • Dataset path or database connection details.
      • Privacy thresholds (e.g., k-anonymity, l-diversity levels).
      • Fields to scan (specify sensitive fields to be evaluated).
  • Define custom privacy rules if needed by extending the rule set in rules.py.

STEP 4: EXECUTE THE PRIVACY SCAN

  • Run the tool with the following command
    • python privacy_scan.py --config config.yaml
  • Monitor the output for real-time feedback on privacy vulnerabilities. The tool generates a detailed report highlighting any potential risks, categorized based on severity (low, medium, high)

STEP 5: REVIEW AND INTERPRET RESULTS

  • Examine the generated privacy report, which includes:
    • Field Name: Identifies the dataset field evaluated.
    • Risk Level: Severity of the privacy risk (low, medium, high).
    • Privacy Violation Type: Indicates the type of privacy violation (e.g., re-identification risk, insufficient anonymization).
    • Recommendation: Suggested actions to mitigate the risk
Field NameRisk LevelViolation TypeRecommendation
patient_idHighRe-identification RiskApply pseudonymization or remove the field
zipcodeMediumGranularity of Location DataAggregate data to 3-digit zip code
birthdateHighDirect indentifierUse age range instead of exact birthdate

STEP 6: MITIGATE PRIVACY RISKS

  • Apply the recommended mitigations to reduce the identified privacy risks:
    • Aggregate, pseudonymize, or anonymize sensitive fields.
    • Re-run the Privacy Scan Tool after applying mitigations to ensure risks have been addressed.

STEP 7: DOCUMENT THE PROCESS AND RESULTS

  • Document each scan and the mitigations applied for future reference. Include:
    • Date of the scan.
    • Dataset description.
    • Privacy violations detected.
    • Actions taken to resolve the violations.
  • Store the report and documentation securely in a version-controlled repository:
    • GitHub: Upload the report to the privacy-scan-reports folder, using a branch named scan-report-[dataset-name] for version control.
    • Naming convention for the report should be PrivacyScanReport_DatasetName_MMDDYY

STEP 8: SHARE THE PRIVACY REPORT WITH THE TEAM

  • Once the scan is complete and privacy risks have been mitigated, distribute the final privacy report to the designated team members for review:
    • Email: Share the report with the Data Privacy Lead (e.g., Jane Doe at jane.doe@organization.org)
    • GitHub: Commit and push the report to the repository for broader team access.

STEP 9: CONTINUOUS MONITORING AND RE-ASSESSMENT

  • Set a regular schedule for privacy scans based on data updates (e.g., monthly or quarterly).
  • Periodically review the scan tool configuration and update the privacy rules to align with evolving privacy standards and regulatory requirements (e.g., GDPR, HIPAA).

Reference Materials

  1. Privacy Scan Tool GitHub Repository
  2. GDPR Overview
  3. HIPPA Compliance Guide
  4. K-Anonymity

SOP for Privacy Scan Tool Operations

Version History

Purpose

This Standard Operating Protocol (SOP) outlines the detailed procedures for utilizing the privacy scan tool to effectively identify, assess, and manage potential privacy risks in datasets used within the CHORUS project. It is intended for project team members responsible for conducting data privacy assessments, implementing risk mitigation strategies, and ensuring compliance with relevant privacy standards and regulations. By adhering to this SOP, team members will be able to systematically approach privacy risk management, thereby safeguarding sensitive information and maintaining the integrity of the project's data handling processes.

Procedures Option 1 - Locally installed program

Step 1: Install Privacy Scan Tool

  • Ensure the necessary dependencies are installed:
    • Python 3.8 or higher.
    • Required Python libraries from requirements.txt in the repository.
    • Git for repository cloning.
  • Running the program in a virtual environment is strongly recommended.
  • Clone the Privacy Scan Tool repository from GitHub:
    git clone https://github.com/chorus-ai/ChoRUS_Privacy_Scan.git
    cd ChoRUS_Privacy_Scan
  • Install the dependencies by running:
    pip install -r requirements.txt

Step 2: Prepare Datasets for Scanning

  • Ensure datasets are in a suitable format (CSV, or database connection).
  • Anonymize or pseudonymize sensitive fields if necessary before scanning.
  • Load the dataset into the Privacy Scan Tool’s input directory or configure a connection string for direct database access.

Step 3: Configure the Scan Tool

  • Adjust the tool’s configuration to match the dataset and privacy rules:
    • Modify the config.py file to set parameters such as:
      • Dataset path or database connection details.
      • The Model location is there is a model update.
      • Data Sample Size

Step 4: Execute the Privacy Scan

  • Run the tool with the following command:
    python main.py
  • Monitor the output for real-time feedback on privacy vulnerabilities. The tool generates a detailed report with a risk score highlighting any potential risk. As an additional recommendation, consider setting a privacy risk threshold, for example, 0.95. However, each site should determine the best threshold for their needs and circumstances.

Step 5: Review and Interpret Results

  • Examine the generated privacy report, which includes:
    • Column: Identifies the dataset field evaluated.
    • Predicted Result: Severity of the privacy risk (1 for High risk).
    • Unique Values: The total unique value of the sample data.
    • Unique Ratio: The ratio of unique values vs total values.
    • Value Counts: The top unique values and their counts, which is a preview of the detailed data.
ColumnPredicted ResultUnique ValuesUnique RatioValue Counts
patient_id1.05001.0MR001:1,MR002:2 ...
zipcode1.01000.2233:20,772:15,
birthdate1.01000.1Not Available

Step 6: Mitigate Privacy Risks

  • Apply the recommended mitigations (decided locally by each site) to reduce the identified privacy risks:
    • Aggregate, pseudonymize, or anonymize sensitive fields.
    • Re-run the Privacy Scan Tool after applying mitigations to ensure risks have been addressed.

Step 7: Document the Process and Results

  • Document each scan and the mitigations applied for future reference. Include:
    • Date of the scan.
    • Dataset description.
    • Privacy violations detected.
    • Actions taken to resolve the violations.
  • Store the report and documentation securely in a version-controlled repository:
    • GitHub: Upload the report to the privacy-scan-reports folder, using a branch named scan-report-[dataset-name] for version control.
    • Naming convention for the report should be PrivacyScanReport_DatasetName_MMDDYY.

Step 8: Share the Privacy Report with the Team

  • Once the scan is complete and privacy risks have been mitigated, distribute the final privacy report to the designated team members for review:
    • Email: Share the report with the Data Privacy Lead (e.g., Luyao Chen at luyao.chen@uth.tmc.edu).
    • GitHub: Commit and push the report to the repository for broader team access.

Step 9: Continuous Monitoring and Re-Assessment

  • Set a regular schedule for privacy scans based on data updates (e.g., monthly or quarterly).
  • Periodically review the scan tool configuration and update the privacy rules to align with evolving privacy standards and regulatory requirements (e.g., GDPR, HIPAA).

Procedures Option 2 - Scan with a container

Another option is to run it via a docker container.

The below portion explains how to run the Privacy Scan Tool using a Docker container, including details on setting up the configuration file config.json and specifying the necessary directories for volumes.

Step 1: Prepare the data

  • Create a subfolder to hold input the csv files: data_folderfor the below samples.
  • Create a subfolder to hold the output: output for the samples below.
  • Configuring the config.json

The config.json file contains various settings for the Privacy Scan Tool. Below is an example configuration file:

{
"available_dbs": {
"PSQL_MIMIC": ["postgresql://userid:password@192.168.0.100:5432/mimic", "mimiciii"],
"LOCAL_TEXT_FILES": "LOCAL_TEXT_FILES"
},
"text_file_location": "./data_folder",
"output_folder": "./output",
"result_file": "phi_scan_results.xls",

"selected_db": "PSQL_MIMIC",
"tables_to_scan" : ["admissions","patients"],

"data_profile_sample_size": 1000,
"PHI_SCAN_MODEL": "./phi_scan/XGBClassifier(V220240514).json"
}

Configuration Details

  • available_dbs: Lists the available databases.
    • PSQL_MIMIC: PostgreSQL database connection details and the database name.
    • LOCAL_TEXT_FILES: Placeholder for using local text files.
  • text_file_location: Directory location for local text files.
  • output_folder: Directory for the output results.
  • result_file: Name of the result file.
  • selected_db: Specifies the database to use (PSQL_MIMIC or LOCAL_TEXT_FILES).
  • tables_to_scan: Lists the tables or files to scan.
  • data_profile_sample_size: Specifies the sample size for data profiling.
  • PHI_SCAN_MODEL: Path to the PHI scan model.

Configurations for Different Scenarios

Using Local Text Files

If you want to use local text files, modify the config.json as follows:

{
"available_dbs": {
"PSQL_MIMIC": ["postgresql://userid:password@192.168.0.100:5432/mimic", "mimiciii"],
"LOCAL_TEXT_FILES": "LOCAL_TEXT_FILES"
},
"text_file_location": "./data_folder",
"output_folder": "./output",
"result_file": "phi_scan_results.xls",

"selected_db": "LOCAL_TEXT_FILES",
"tables_to_scan" : ["noshow.csv"],

"data_profile_sample_size": 1000,
"PHI_SCAN_MODEL": "./phi_scan/XGBClassifier(V220240514).json"
}
Using PostgreSQL Database

If you want to use a PostgreSQL database, ensure the config.json is set as follows:

{
"available_dbs": {
"PSQL_MIMIC": ["postgresql://userid:password@192.168.0.100:5432/mimic", "mimiciii"],
"LOCAL_TEXT_FILES": "LOCAL_TEXT_FILES"
},
"text_file_location": "./data_folder",
"output_folder": "./output",
"result_file": "phi_scan_results.xls",

"selected_db": "PSQL_MIMIC",
"tables_to_scan" : ["person","observation"],

"data_profile_sample_size": 1000,
"PHI_SCAN_MODEL": "./phi_scan/XGBClassifier(V220240514).json"
}

Ensure you replace postgresql://userid:password@192.168.0.100:5432/mimic with your actual PostgreSQL connection details and the database name.

Step 2: Command to Run the Docker Container

To run the Privacy Scan Tool Docker container, use the following command:

docker run --rm -v $(pwd)/output:/privacy_scan_tool/output -v $(pwd)/config.json:/privacy_scan_tool/config.json -v $(pwd)/data_folder:/privacy_scan_tool/data_folder ghcr.io/chorus-ai/chorus-privacy:main

Explanation of the Docker Command

  • docker run: Runs the Docker container.
  • -v $(pwd)/output:/privacy_scan_tool/output: Maps the local output directory to the container's /privacy_scan_tool/output directory.
  • -v $(pwd)/config.json:/privacy_scan_tool/config.json: Maps the local config.json file to the container's /privacy_scan_tool/config.json file.
  • --rm: remove the container after finishing.
  • -v $(pwd)/data_folder:/privacy_scan_tool/data_folder: Maps the local data_folder directory to the container's /privacy_scan_tool/data_folder directory.
  • ghcr.io/chorus-ai/chorus-privacy:main: Specifies the Docker image to use.

Step 3 Running the Tool

  1. Ensure Docker is installed and running on your machine.
  2. Place the config.json file in the current working directory.
  3. Place any required data files in the data_folder directory.
  4. Execute the Docker run command provided above.
  5. The tool will process the data according to the configuration and output the results to the specified output directory.
  6. If the table has no or less than 1000 records. A warning message will be put into the output folder as well.

Step 4 Review the result and follow up on the risks found ( same as step 6 onwards of option 1).

The result is output in the output folder. The format is the same as step 5 of option 1.

Reference Materials

  1. Privacy Scan Tool GitHub Repository
  2. GDPR Overview
  3. HIPAA Compliance Guide
  4. K-Anonymity