Skip to main content

Exposome Geocoder – Input Preparation and Usage Guide

Note: This toolkit does not share any Protected Health Information (PHI).

This repository provides a reproducible workflow to geocode patient location data (Phase 1) and link the resulting coordinates with exposome datasets (Phase 2). This workflow ensures that sensitive address data remains local while generating standardized exposure metrics to be shared to central server without identifiers.

This SOP describes the workflow for running codes to geocode patient location data and link latitude and longitude coordinates with exposome datasets. All code will be executed locally at each site. Only the exposure tables containing exposome data will be shared with the central server; no address-level data will be transmitted or stored centrally. Sites should use the most granular address information available to them or latitude/longitude coordinates.

Demo video Watch here


📑 Table of Contents


Overview

This workflow uses two separate Docker containers to support end-to-end geocoding and data linkage:

  1. Exposome Geocoder Container (prismaplab/exposome-geocoder:1.0.3)
    Performs address or coordinate-based geocoding to generate Census Tract (FIPS 11-digit) codes using DeGAUSS backend tools.

  2. Exposome Linkage Container (ghcr.io/chorus-ai/chorus-postgis-exposure:main)
    Integrates the geocoded outputs with relevant environmental and social determinant datasets to produce analysis-ready files.

Together, these containers enable:

  • Address and latitude/longitude-based geocoding
  • OMOP CDM geocoding extraction and processing
  • GIS linkage with PostGIS-SDoH indices (ADI, SVI, AHRQ)

Input Options

Phase 1 (Geocoding) Input: To generate coordinates, you need to prepare only ONE of the following data elements per encounter (Option 1: Address, Option 2: Coordinates, or Option 3: OMOP CDM tables).

Phase 2 (Linkage) Input: Regardless of the input option chosen for Phase 1, the final output MUST be transformed into two specific CSV files to run Phase 2.

  • LOCATION.csv: Contains the physical coordinates (latitude, longitude) and identifiers (location_id).
  • LOCATION_HISTORY.csv: Contains the temporal mapping of a person (entity_id, which is same as person_id) to a location (location_id) over a specific time range (start_date, end_date).

See Appendix A for the Data Dictionary and population logic.

Option 1: Address

Sample input files here

  • Format A: Multi-Column Address
streetcitystatezipyearentity_id
1250 W 16th StJacksonvilleFL3220920191
2001 SW 16th StGainesvilleFL3260820192

Tip: Street and ZIP are required. Missing these fields may lead to imprecise geocoding.

  • Format B: Single Column Address
addressyearentity_id
1250 W 16th St Jacksonville FL 3220920191
2001 SW 16th St Gainesville FL 3260820192

Optional Supporting Files

Including the following optional files will help streamline the end-to-end workflow between geocoding and exposome linkage:

  • Important: Do not date-shift your LOCATION/LOCATION_HISTORY files before linkage. Date shifting (if used) should occur post linkage in Step 4.

  • LOCATION.csv

  • LOCATION_HISTORY.csv

If these files are provided during geocoding, the output will automatically include the updated latitude and longitude information required for the postgis linkage container.

If they are not provided, users will need to manually update their LOCATION files with the geocoded latitude/longitude before executing the commands for linkage.

LOCATION.csv (Follows CDM format)
location_idaddress_1address_2citystatezipcountylocation_source_valuecountry_concept_idcountry_source_valuelatitudelongitude
11248 N Blackstone AveFRESNOCA93703UNITED STATES OF AMERICAUNITED STATES OF AMERICA36.75891146-119.7902719
LOCATION_HISTORY.csv (Follows CDM format)
location_idrelationship_type_concept_iddomain_identity_idstart_dateend_date
132848114731437631998-01-012020-01-01

Option 2: Coordinates

Sample input files here

latitudelongitudeentity_id
30.353463-81.67491
29.634219-82.34332

As with address-based input, including LOCATION.csv and LOCATION_HISTORY.csv enables seamless downstream processing with the linkage container.


Option 3: OMOP CDM

TableRequired Columns
personperson_id
visit_occurrencevisit_occurrence_id, visit_start_date, visit_end_date, person_id
locationlocation_id, address_1, address_2, city, state, zip, location_source_value, country_concept_id, country_source_value, latitude, longitude
location_historylocation_id, relationship_type_concept_id, domain_id, entity_id, start_date, end_date

If you have OMOP CDM with required elements already, it can be used to prepare location and location history CSV tables as required by Phase 2.


Usage Guide

Step 1: Prepare Input Data

Prepare only ONE of the data elements as indicated under the Input Options per encounter.
For Option 1 (Address) or Option 2 (Coordinates), your data must be in a CSV file format.

Folder Structure

  • Place the CSV file(s) in a dedicated folder
    • 📂 input_address/ (for address-based data)
    • 📂 input_coordinates/ (for coordinate-based data)
  • Optionally, include:
    • LOCATION.csv
    • LOCATION_HISTORY.csv

⚠️ Only .csv files are supported. Convert .xlsx or other formats before running the tool.

Guidance on Populating LOCATION_HISTORY.csv:

This table links a person to a specific location for a specific time range.

  • If you have full residential history: Use the actual move-in (start_date) and move-out (end_date) dates.
  • If you only have location for index ICU encounter and do not have access to previous residential addresses with date stamps as required by location history table, you can use the following logic to populate LOCATION_HISTORY.csv: If linking to specific encounters, set start_date equal to the relevant encounter admission date, that is visit_start_date (in visit_occurance table) . Set end_date to NULL.

Step 2: Generate FIPS Codes

Container: prismaplab/exposome-geocoder:1.0.3
Ensure Docker Desktop is running.

This step uses the Exposome Geocoder container to:

  • Convert addresses or coordinates into latitude/longitude
  • Assign 11-digit Census Tract (FIPS) codes

For CSV Input (Option 1 & 2)

For macOS / Linux / Ubuntu
docker run -it --rm \
-v "$(pwd)":/workspace \
-v /var/run/docker.sock:/var/run/docker.sock \
-e HOST_PWD="$(pwd)" \
-w /workspace \
prismaplab/exposome-geocoder:1.0.3 \
/app/code/Address_to_FIPS.py -i <input_folder_path>
For Windows
  • Open Command Prompt or powershell
  • Run command wsl
  • Execute the same command as above inside your WSL terminal.

Example:

If your file is named patients_address.csv inside 📂input_address/, run:

docker run -it --rm   -v "$(pwd)":/workspace   -v /var/run/docker.sock:/var/run/docker.sock   -e HOST_PWD="$(pwd)"   -w /workspace   prismaplab/exposome-geocoder:1.0.3   /app/code/Address_to_FIPS.py -i input_address

For OMOP Input (Option 3)

To extract and geocode directly from an OMOP database:

docker run -it --rm \
-v /var/run/docker.sock:/var/run/docker.sock \
-v "$(pwd)":/workspace \
-e HOST_PWD="$(pwd)" \
-w /workspace \
prismaplab/exposome-geocoder:1.0.3 \
/app/code/OMOP_to_FIPS.py \
--user <your_username> \
--password <your_password> \
--server <server_address> \
--port <port_number> \
--database <database_name>

Note on Dependencies (Firewall Warning):

The Address_to_FIPS.py script attempts to pull Docker images automatically. If you have a strict firewall, you may need to pull these images manually before running the script:

docker pull ghcr.io/degauss-org/geocoder:3.3.0
docker pull ghcr.io/degauss-org/census_block_group:0.6.0

Step 3: Output Structure

After running the geocoder container (for Option 1, 2, or 3), the tool generates output files in the output/ folder.

CSV Input (Option 1 & 2)

Sample outputs demo/address_files/output

Files Generated Each input file produces:

  • <filename>_with_coordinates.csv — input + latitude/longitude
  • <filename>_with_fips.csv — input + FIPS codes

Output Folder Example

output/
├── coordinates_from_address_<timestamp>.zip
├── geocoded_fips_codes_<timestamp>.zip

<timestamp> indicates when the script was executed (e.g., 20250624_150230).

If LOCATION.csv and LOCATION_HISTORY.csv were included, they are copied to output/ but not zipped.

IMPORTANT TO NOTE

Phase 2 input preparation note: If you used Option 1 (Address) and did not provide a pre-built LOCATION.csv, you can use the CSV inside coordinates_from_address_<timestamp>.zip (generated in Phase 1) as the source of geocoded latitude/longitude values to populate LOCATION.csv. Ensure your location_id values are consistent between LOCATION.csv and LOCATION_HISTORY.csv before running Phase 2.

Zipped Output Columns Description

ColumnDescription
LatitudeLatitude returned from the geocoder
LongitudeLongitude returned from the geocoder
geocode_resultOutcome of geocoding — geocoded for successful matches, Imprecise Geocode if not precise
reasonFailure reason if applicable (see Reason Column Values)
Reason Column Values

Used when geocoding fails or is imprecise. Possible values include:

  • Hospital address given – Detected from known hardcoded hospital addresses.
  • Street missing – No street info provided.
  • Blank/Incomplete address – Address is empty or has missing components.
  • Zip missing – ZIP code not provided.

💡 Tip: You can expand hospital detection by adding known addresses to HOSPITAL_ADDRESSES in Address_to_FIPS.py.

Formatting Note for HOSPITAL_ADDRESSES:

  • Single-line string
  • Lowercase letters and numbers only
  • No commas or special characters
  • Fields separated by single spaces

OMOP Input (Option 3)

Sample outputs: demo/OMOP/output

Folder Structure

OMOP_data/
├── valid_address/ # Records with address, no lat/lon
├── invalid_lat_lon_address/ # Records missing both address and lat/lon
├── valid_lat_long/ # Records with lat/lon

OMOP_FIPS_result/
├── address/
│ ├── address_with_coordinates.zip # CSVs with lat/lon from address
│ └── address_with_fips.zip # CSVs with FIPS codes
├── latlong/
│ └── latlong_with_fips.zip # CSVs with FIPS from coordinates
├── invalid/ # Usually empty; no usable location data

LOCATION.csv
LOCATION_HISTORY.csv

Step 4: GIS Linkage with PostGIS-Exposure Tool

Purpose:
Spatially joins the lat/lon (and FIPS) from geocoding with geospatial indices (ADI, SVI, AHRQ) and produces EXTERNAL_EXPOSURE.csv.


Prerequisites for GIS Linkage

  • Docker installed.
  • Clone postgis-exposure repository
  • Update LOCATION, LOCATION_HISTORY files to include the geocoded lat/lon from Step 2. Not needed if you included these during the geocoding step
  • Ensure DATA_SRC_SIMPLE.csv and VRBL_SRC_SIMPLE.csv files are available (centrally managed; no edits required).
  • Important: Do not date-shift your LOCATION/LOCATION_HISTORY files before linkage. Date shifting (if used) should occur following this step.

Sample DATA_SRC_SIMPLE.csv and VRBL_SRC_SIMPLE.csv: here


Expected Outputs

  • EXTERNAL_EXPOSURE.csv containing linked indices (ADI, SVI, AHRQ metrics).

GIS Linkage Workflow

  1. Start Postgres/PostGIS container following the instructions in the postgis-exposure repository. Container sequence: start/load database → ingest location tables → run the produce script. First Docker command (prepares the database):

    docker run --rm --name postgis-chorus \
    --env POSTGRES_PASSWORD=dummy \
    --env VARIABLES=134,135,136 \
    --env DATA_SOURCES=1234,5150,9999 \
    -v $(pwd)/test/source:/source \
    -d ghcr.io/chorus-ai/chorus-postgis-exposure:main
    • Replace VARIABLES with the comma-separated list of variable IDs you need from VRBL_SRC_SIMPLE.csv.
    • Replace DATA_SOURCES with the relevant data source IDs (from DATA_SRC_SIMPLE.csv).
  2. ** Generate the external exposure file:**

    docker exec postgis-chorus /app/produce_external_exposure.sh
  3. Output: EXTERNAL_EXPOSURE.csv will appear in your mounted directory (e.g., ./test/source).

Notes & Tips

  • Run these commands in Terminal (Mac) or WSL/PowerShell/Command Prompt on Windows; WSL is more robust for Docker on Windows.
  • If your site needs more variables, expand VARIABLES accordingly.
  • Important: The container may only run successfully once. To rerun, you may need to delete the container and image, then pull the image again.

Step 5: Validate & Inspect Outputs

  • Open EXTERNAL_EXPOSURE.csv. Confirm:
    • Patient ID, lat, lon, FIPS
    • ADI, SVI, AHRQ, and VRBL-coded fields
  • Spot-check a few records for accuracy.
  • If errors:
    • Ensure LOCATION has valid lat/lon/FIPS
    • Confirm VARIABLES and DATA_SOURCES are correct
    • Check mount paths

Step 6: Optional - Site-level Date Shifting

Purpose: Anonymize temporal data while preserving relative timelines.

Guidelines:

  • Apply date shifts locally before upload — do not date-shift prior to GIS linkage.
  • Input: EXTERNAL_EXPOSURE.csv (from Step 4)
  • Output: EXTERNAL_EXPOSURE_date_shifted.csv

See Date Shifting SOP for More Details.


Step 7: Upload & Centralized De-identification

  1. Upload the (optionally date-shifted) EXTERNAL_EXPOSURE.csv to the central repository.
  2. The central team will apply further de-identification.

References & sample files

Geocoding

GIS Linkage

  • Sample files: PostGIS Exposure CSVs
    • Site-specific: LOCATION, LOCATION_HISTORY
    • Centrally managed: DATA_SRC_SIMPLE, VRBL_SRC_SIMPLE

The following office hour sessions provide additional context and demonstrations related to this SOP:

  • [08-07-25] Integration of GIS and SDoH data with OMOP

  • [09-18-25] Processing OMOP location_history table into external_exposure table

  • [09-25-25] End-to-end demo for capturing GIS data with OMOP

  • [10-16-2025] End-to-end demo for capturing GIS data with OMOP or address/latlong

    • Video Recording | Transcript
    • Complete workflow demonstration for GIS data capture and processing based on updated documentation

Appendix

Appendix A: Data Dictionary and Logic

To successfully run Phase 2, your data must match the OMOP CDM definitions below.


1. LOCATION Table

Represents physical location or address information.

FieldDescription
location_idThe unique key assigned to a Location. Each instance of a Location in the source data should use this key. [REQUIRED]
address_1First line of the address.
address_2Second line of the address.
cityCity name.
stateState name.
zipZip codes are handled as strings (3 digit or 5 digit).
countyCounty name.
latitudeGeocoded latitude (Float). [REQUIRED]
longitudeGeocoded longitude (Float). [REQUIRED]

2. LOCATION_HISTORY Table

Stores relationships between persons and geographic locations over time.

FieldDescription
location_idReferences the location_id in the LOCATION table. [REQUIRED]
entity_idUnique identifier for the entity (e.g., person_id). [REQUIRED]
domain_idDomain of the entity. Must be PERSON for this pipeline. [REQUIRED]
start_dateDate the relationship started. [REQUIRED]
end_dateDate the relationship ended.

3. EXTERNAL_EXPOSURE Table

After Phase 2 execution, the pipeline generates the external_exposure table with the columns below.

VariableDescription
external_exposure_idUnique row identifier for the exposure record.
location_idForeign key linking to the input LOCATION.csv file.
person_idForeign key linking to entity_id in the input LOCATION_HISTORY.csv file.
exposure_start_dateStart date of the exposure event (calculated overlap).
exposure_end_dateEnd date of the exposure event.
exposure_source_valueName of the environmental variable linked.
value_as_numberNumerical value of the environmental variable.
unit_concept_idOMOP Concept ID representing the unit of measure.
exposure_concept_idOMOP Concept ID representing the environmental variable.
exposure_type_concept_idOMOP Concept ID for the type of exposure.
value_as_concept_idOMOP Concept ID for categorical results.

Note: This table reflects the exposure data generated as output of Phase 2.


Appendix B: Geocoding Workflow

This guide outlines the scripts, workflows, and Docker based DeGAUSS toolkit used to generate latitude and longitude coordinates from patient data. The process follows a two step geocoding workflow powered by DeGAUSS and executed locally via Docker containers.

Method: DeGAUSS Toolkit (Docker-based)

DeGAUSS consists of two Docker containers:

  1. Geocoder (3.3.0) — Converts address to latitude/longitude
  2. Census Block Group (0.6.0) — Converts latitude/longitude to Census Tract FIPS codes
StepPurposeDocker Image
1Address → Coordinatesghcr.io/degauss-org/geocoder:3.3.0
2Coordinates → FIPSghcr.io/degauss-org/census_block_group:0.6.0

DeGAUSS Docker Commands (Executed Internally)

# Step 1: Get Coordinates from Address
docker run --rm -v "ABS_OUTPUT_FOLDER:/tmp" \
ghcr.io/degauss-org/geocoder:3.3.0 \
/tmp/<your_preprocessed_input.csv> <threshold>

# Step 2: Get FIPS from Coordinates
docker run --rm -v "ABS_OUTPUT_FOLDER:/tmp" \
ghcr.io/degauss-org/census_block_group:0.6.0 \
/tmp/<your_coordinate_output.csv> <year>

Replace values:

  • ABS_OUTPUT_FOLDER → absolute path to your output directory
  • <threshold> → numeric value (e.g., 0.7)
  • <year> → either 2010 or 2020

Script Highlights

While our codes have functionality to generate FIPS codes in addition to latitude and longitude coordinates as detailed below, Phase 2 requires only latitude and longitude coordinates.

Address_to_FIPS.py Logic

This script handles CSV-based input:

  • Reads CSV files
  • Normalizes address or uses lat/lon
  • Runs DeGAUSS Docker container to generate:
    • Latitude/Longitude (via ghcr.io/degauss-org/geocoder)
    • FIPS codes(via ghcr.io/degauss-org/census_block_group)
  • Packages outputs into ZIP
OMOP_to_FIPS.py Logic

This script integrates directly with OMOP CDM:

  • Extracts OMOP CDM data
  • Categorizes into valid/invalid address or coordinates
  • Executes FIPS generation (same as CSV workflow)
  • Packages outputs into ZIP