Waveform File Format

1. Purpose

This document defines acceptable file formats for waveform data contributions in the CHoRUS project. The purpose is to ensure consistent formatting of clinical waveform data. For a high level summary of the process for contributing waveform data to CHoRUS, see the SOP on contributing waveforms. All waveform data will be centrally converted to WFDB. We encourage submissions in your preferred file format—ideally with explicit timestamps—to facilitate a consistent and robust transformation to WFDB.

2. File Structure

All waveform data must adhere to the following directory structure to facilitate ingestion and processing.

2.1. Example File Structure

.
├── OMOP
|  └── waveform_visit_links.csv             # File linking visit occurrence and waveform session
├── SUBMISSION.md  
└── <person_id>                             # Patient-specific subdirectory
   └── waveforms                            # "waveforms" folder
       ├── telemetry                        # Folder for telemetry waveforms
       |   └── <session_id>                 # Session ID folder
       |       ├── <session_id>.<ext>       # High-frequency waveform or annotation file(s) 
       |       └── <session_id>n.csv        # Low-frequency "numerics" file  
       └── eeg                              # Folder for EEG waveforms
           └── <session_id>                 # Session ID folder
               ├── <session_id>.<ext>       # High-frequency waveform or annotation file(s) 
               └── <session_id>n.csv        # Low-frequency "numerics" file 

2.2. Explanation of File Structure

OMOP
- Root directory for your OMOP tables.
waveform_visit_links.csv
- This file links each waveform recording / session to a visit occurrence. Submit this file under your /OMOP/ folder. See the section on Linking Waveform Sessions to Visits below for details about this file.
SUBMISSION.md
- File outlining the details of your submission, as described elsewhere in this SOP. Submit this file at the same level as your /OMOP/ and <person_id> folders.
<person_id>/:
- Root directory for all patient data. Each patient should have a unique identifier (e.g., 10001).
waveforms/:
- Patients may have multiple types of data. The waveforms folder is intended for waveform data.
telemetry/:
- All telemetry waveforms should be submitted under the telemetry folder. High-frequency waveform signal files and annotation files should be named <session_id> followed by the appropriate extension for a signal file or annotation file in the submitted format. Low-frequency "numerics" signals should be submitted in a CSV file with a "n" before the extension.
eeg/:
- All EEG waveforms should be submitted under the EEG folder. EEG waveform signal files and annotation files should be named <session_id> followed by the appropriate extension for a signal file or annotation file in the submitted format. Low-frequency "numerics" signals should be submitted in a CSV file with a "n" before the extension.
<session_id>/:
- Session ID indicates a unique recording for the patient. A single patient (i.e. a single <person_id>) may have multiple sessions.

3. Preparing Waveforms

Ultimately, all waveform files will be run through the CHoRUS to WFDB converter to ensure consistent handling of signals. Data should be prepared such that it is compatible with the CHoRUS to WFDB converter. If you are manipulating or converting the format of your waveform data prior to submission, please do so in a manner consistent with the guidelines below. We don't recommend manipulating your waveform data beyond deidentifying it per your site requirements for submission to CHoRUS.

3.1. Session Definition

A session (or record in WFDB terminology) represents a single "continous" recording from a patient. One or more sessions may be captured within a single hospital admission. Ideally, each session should contain data continously collected from a single device. Therefore, telemetry and EEG waveforms should be assigned unique session_ids. However, if detailed device identification is not available, it is acceptable to store all telemetry (or EEG) data for a single visit as one session. A session is sometimes identifiable by a “session ID” that is captured within the recording. Please supply session_id as an integer.

3.2. Time Shifting

If your EHR, waveform, and imaging data are being date-shifted, ensure that all data for a given subject is shifted by the same duration. See Multimodal-Linkage for more detail.

3.3. Gaps and Overlaps

When reviewing a waveform signal you may notice the following issues:

Intervals where data is missing, typically due to a network interruption or a disconnected device
Points where the timestamps of the data appear to jump forward or backward in time despite the data itself being continuous
Jitter in the timestamps in cases where they were not recorded by the original device

The goal in handling these issues is to maintain integrity of the waveform in order to enable machine analysis. CHoRUS central will attempt to correct for these issues if provided the necessary information as outlined in this SOP. If you have already manipulated your waveform data to handle gaps and overlaps (this is not recommended), please indicate the following in your SUBMISSION.md (template available here):

Whether you were able to differentiate between gaps due to missing data and other types of gaps (e.g. a gap due to a clock resync)
How you modified your data for each gap type

3.4. Standardized Channel Naming

We expect the signal labels for each channel to follow the ISO/IEEE 11073 standards. The signal labels for the requested waveforms can be found here:

As requested below you will need to map each signal from your submitted waveforms to the appropriate signal label provided in the links above. If you have multiple channels in your data which correspond to a single standardized signal label, simply map them as such. For example, if you have two arterial blood pressure signals ART1, and ART3, you can map both of them to MDC_PRESS_BLD_ART. The addition of numeric suffixes to the standardized labels will be handled centrally.

3.5. Waveform Quality

For the waveforms to be useful in research, they have to meet some basic quality standards. In particular, the samples need to be collected often enough and the resolution needs to be precise enough to capture relevant physiological events. The waveform conversion page, with the standardized signal labels mentioned above, also provides a minimum sample rate and resolution for each signal type. This indicates the required sampling rate and resolution when the signal was originally recorded. Do not upsample in an effort to meet the minimum requirements, as that will not produce a higher quality waveform.

3.6. Low Frequency Signals ("Numerics")

Low frequency signals (e.g. heart rate) should be saved as a CSV file in the same folder as the high frequency waveforms. Please do not resample the low frequency signals in an effort to create a uniform time interval. Please note that dates in numerics data must be shifted in manner consistent with the other electronic health record and waveform data being submitted, as described in the Time Shifting section above.

4. Submitting Waveforms

Please submit waveforms in a format that closely resembles the output from your monitor, where possible. When waveform data is submitted with explicit timestamps, we can take advantage of tools which help correct for gaps in an appropriate manner.

4.1. Submit in Preferred Format with a Mapping File

Submissions should either be formatted as pair-based or stream-based:

Pair-based: one timestamp for each sample
Stream-based: more than one sample per row with a corresponding "start" time and "end" time for the stream. Please indicate whether the start and end times correspond the beginning or end of the first and last samples.

In your SUBMISSION.md file, please describe your preferred format and include these details as appropriate:

A description of each variable (e.g. StartTime represents the time at the start of the first sample in the message stream)
The data type for each variable
Units of each variable, where useful. For the signal data, please provide the units of the amplitude. Also, please provide time units (milliseconds, microseconds, epoch time, etc.) for the variables representing the time of the samples.
Details on how the data is encoded, if applicable
Details on whether sample intervals are uniform across waveform types and channels. High-frequency waveforms should have uniform sample intervals, but if this is not the case please explain why. We expect that numerics files may have non-uniform sample intervals.
Pair-based submissions
- Instructions for how to determine the sample number (or "monotonic time") at the start of each segment / message stream.
- Specification of the timestamp variables that represent the wall-clock time at the start and end of each segment / message stream (including whether they correspond to the start or end of the first / last sample).

Importantly, please submit a mapping file as requested in the next subsection.

4.1.1. Mapping File

Please submit mapping file(s) which will allow us to convert the site specific information from your waveform data into WFDB format. Please submit the file as <site_name>_wf_<telemetry_or_eeg>_mapping.csv. If your site is submitting EEG waveforms in addition to telemetry waveforms, please provide a separate mapping file for the EEG waveforms.

The mapping file should have the following columns. For each signal in your data (e.g. ECG lead II, PPG, EEG lead FP1, etc.) add a row which supplies this information:

signal_id: the (integer, or string) variable that indicates signal name in your submitted format
standardized_label: the standardized signal label name (string) that your signal_id maps to. See standardized channel naming above for details on how to do this mapping
sampling_rate: the nominal sampling rate for the channel, in Hz (float). Please provide as much precision as is known to you (e.g. 249.987 Hz).
units: a string giving the units of the channel

Pair-based submissions should also have:

timestamp_id: the (integer, or string) variable that identifies the master timestamp in your data corresponding to the start of a given sample. If your timestamps are implicit (start_time + i*sampling_rate), this should point to the variable which has the start time for the waveform.

Stream-based submissions should also have:

start_time: the (integer, or string) variable that identifies the start of the data stream for each row. Please remember to indicate (in your SUBMISSION.md) whether the timestamp corresponds to the start or end of the first sample in the stream.
end_time: the (integer, or string) variable that identifies the end of the data stream for each row. Please remember to indicate (in your SUBMISSION.md) whether the timestamp corresponds to the start or end of the last sample in the stream.

Submit your mapping file(s) to CHoRUS to WFDB converter repo. Provide the mapping file(s) under mappings/<site>/

5. Linking Waveform Sessions to Visits

A table named waveform_visit_links.csv should be provided under the /OMOP/ folder. This table should have a column for visit_occurrence_id, visit_detail_id, and session_id. If the data collection system does not provide a direct linkage between session_id and visit_occurrence_id, it may be necessary to infer this linkage based on the start and end time of the session but this should only be done as a last resort. Keep in mind that the session start and end may not exactly match the admission, discharge, or transfer time in the EHR. See Multimodal-Linkage for more detail.

Please outline the process you used for linking session_id to visit_detail_id in your SUBMISSION.md file.

6. Revision History

Version	Date	Description
1.0	2024-11-21	Initial version
2.0	2025-02-10	Clarified that waveforms should be submitted in preferred format. Updated file structure.

7. Contact Information

CHoRUS Waveform Working Group
- Email: Brian Gow <briangow@mit.edu>

1. Purpose​

2. File Structure​

2.1. Example File Structure​

2.2. Explanation of File Structure​

3. Preparing Waveforms​

3.1. Session Definition​

3.2. Time Shifting​

3.3. Gaps and Overlaps​

3.4. Standardized Channel Naming​

3.5. Waveform Quality​

3.6. Low Frequency Signals ("Numerics")​

4. Submitting Waveforms​

4.1. Submit in Preferred Format with a Mapping File​

4.1.1. Mapping File​

5. Linking Waveform Sessions to Visits​

6. Revision History​

7. Contact Information​

8. References​