Waveform File Format
1. Purpose
This document defines acceptable file formats for waveform data contributions in the CHoRUS project. The purpose is to ensure consistent formatting of clinical waveform data. For a high level summary of the process for contributing waveform data to CHoRUS, see the SOP on contributing waveforms. All waveform data will be centrally converted to WFDB. We encourage submissions in your preferred file format—ideally with explicit timestamps—to facilitate a consistent and robust transformation to WFDB.
2. File Structure
All waveform data must adhere to the following directory structure to facilitate ingestion and processing.
2.1. Example File Structure
.
├── OMOP
| └── waveform_visit_links.csv # File linking visit occurrence and waveform session
├── SUBMISSION.md
└── <person_id> # Patient-specific subdirectory
└── waveforms # "waveforms" folder
├── telemetry # Folder for telemetry waveforms
| └── <session_id> # Session ID folder
| ├── <session_id>.<ext> # High-frequency waveform or annotation file(s)
| └── <session_id>n.csv # Low-frequency "numerics" file
└── eeg # Folder for EEG waveforms
└── <session_id> # Session ID folder
├── <session_id>.<ext> # High-frequency waveform or annotation file(s)
└── <session_id>n.csv # Low-frequency "numerics" file
2.2. Explanation of File Structure
OMOP
- Root directory for your OMOP tables.
waveform_visit_links.csv
- This file links each waveform recording / session to a visit occurrence. Submit this file under your
/OMOP/
folder. See the section on Linking Waveform Sessions to Visits below for details about this file.
- This file links each waveform recording / session to a visit occurrence. Submit this file under your
SUBMISSION.md
- File outlining the details of your submission, as described elsewhere in this SOP. Submit this file at the same level as your
/OMOP/
and<person_id>
folders.
- File outlining the details of your submission, as described elsewhere in this SOP. Submit this file at the same level as your
<person_id>/
:- Root directory for all patient data. Each patient should have a unique identifier (e.g.,
10001
).
- Root directory for all patient data. Each patient should have a unique identifier (e.g.,
waveforms/
:- Patients may have multiple types of data. The
waveforms
folder is intended for waveform data.
- Patients may have multiple types of data. The
telemetry/
:- All telemetry waveforms should be submitted under the telemetry folder. High-frequency waveform signal files and annotation files should be named
<session_id>
followed by the appropriate extension for a signal file or annotation file in the submitted format. Low-frequency "numerics" signals should be submitted in a CSV file with a "n" before the extension.
- All telemetry waveforms should be submitted under the telemetry folder. High-frequency waveform signal files and annotation files should be named
eeg/
:- All EEG waveforms should be submitted under the EEG folder. EEG waveform signal files and annotation files should be named
<session_id>
followed by the appropriate extension for a signal file or annotation file in the submitted format. Low-frequency "numerics" signals should be submitted in a CSV file with a "n" before the extension.
- All EEG waveforms should be submitted under the EEG folder. EEG waveform signal files and annotation files should be named
<session_id>/
:- Session ID indicates a unique recording for the patient. A single patient (i.e. a single
<person_id>
) may have multiple sessions.
- Session ID indicates a unique recording for the patient. A single patient (i.e. a single
3. Preparing Waveforms
Ultimately, all waveform files will be run through the CHoRUS to WFDB converter to ensure consistent handling of signals. Data should be prepared such that it is compatible with the CHoRUS to WFDB converter. If you are manipulating or converting the format of your waveform data prior to submission, please do so in a manner consistent with the guidelines below. We don't recommend manipulating your waveform data beyond deidentifying it per your site requirements for submission to CHoRUS.
3.1. Session Definition
A session (or record in WFDB terminology) represents a single "continous" recording from a patient. One or more sessions may be captured within a single hospital admission. Ideally, each session should contain data continously collected from a single device. Therefore, telemetry and EEG waveforms should be assigned unique session_id
s. However, if detailed device identification is not available, it is acceptable to store all telemetry (or EEG) data for a single visit as one session. A session is sometimes identifiable by a “session ID” that is captured within the recording. Please supply session_id
as an integer.
3.2. Time Shifting
If your EHR, waveform, and imaging data are being date-shifted, ensure that all data for a given subject is shifted by the same duration. See Multimodal-Linkage for more detail.
3.3. Gaps and Overlaps
When reviewing a waveform signal you may notice the following issues:
- Intervals where data is missing, typically due to a network interruption or a disconnected device
- Points where the timestamps of the data appear to jump forward or backward in time despite the data itself being continuous
- Jitter in the timestamps in cases where they were not recorded by the original device
The goal in handling these issues is to maintain integrity of the waveform in order to enable machine analysis. CHoRUS central will attempt to correct for these issues if provided the necessary information as outlined in this SOP. If you have already manipulated your waveform data to handle gaps and overlaps (this is not recommended), please indicate the following in your SUBMISSION.md
(template available here):
- Whether you were able to differentiate between gaps due to missing data and other types of gaps (e.g. a gap due to a clock resync)
- How you modified your data for each gap type
3.4. Standardized Channel Naming
We expect the signal labels for each channel to follow the ISO/IEEE 11073 standards. The signal labels for the requested waveforms can be found here:
As requested below you will need to map each signal from your submitted waveforms to the appropriate signal label provided in the links above. If you have multiple channels in your data which correspond to a single standardized signal label, simply map them as such. For example, if you have two arterial blood pressure signals ART1
, and ART3
, you can map both of them to MDC_PRESS_BLD_ART
. The addition of numeric suffixes to the standardized labels will be handled centrally.
3.5. Waveform Quality
For the waveforms to be useful in research, they have to meet some basic quality standards. In particular, the samples need to be collected often enough and the resolution needs to be precise enough to capture relevant physiological events. The waveform conversion page, with the standardized signal labels mentioned above, also provides a minimum sample rate and resolution for each signal type. This indicates the required sampling rate and resolution when the signal was originally recorded. Do not upsample in an effort to meet the minimum requirements, as that will not produce a higher quality waveform.
3.6. Low Frequency Signals ("Numerics")
Low frequency signals (e.g. heart rate) should be saved as a CSV file in the same folder as the high frequency waveforms. Please do not resample the low frequency signals in an effort to create a uniform time interval. Please note that dates in numerics data must be shifted in manner consistent with the other electronic health record and waveform data being submitted, as described in the Time Shifting section above.
4. Submitting Waveforms
Please submit waveforms in a format that closely resembles the output from your monitor, where possible. When waveform data is submitted with explicit timestamps, we can take advantage of tools which help correct for gaps in an appropriate manner.
4.1. Submit in Preferred Format with a Mapping File
Submissions should either be formatted as pair-based or stream-based:
- Pair-based: one timestamp for each sample
- Stream-based: more than one sample per row with a corresponding "start" time and "end" time for the stream. Please indicate whether the start and end times correspond the beginning or end of the first and last samples.
In your SUBMISSION.md
file, please describe your preferred format and include these details as appropriate:
- A description of each variable (e.g.
StartTime
represents the time at the start of the first sample in the message stream) - The data type for each variable
- Units of each variable, where useful. For the signal data, please provide the units of the amplitude. Also, please provide time units (milliseconds, microseconds, epoch time, etc.) for the variables representing the time of the samples.
- Details on how the data is encoded, if applicable
- Details on whether sample intervals are uniform across waveform types and channels. High-frequency waveforms should have uniform sample intervals, but if this is not the case please explain why. We expect that numerics files may have non-uniform sample intervals.
- Pair-based submissions
- Instructions for how to determine the sample number (or "monotonic time") at the start of each segment / message stream.
- Specification of the timestamp variables that represent the wall-clock time at the start and end of each segment / message stream (including whether they correspond to the start or end of the first / last sample).
Importantly, please submit a mapping file as requested in the next subsection.
4.1.1. Mapping File
Please submit mapping file(s) which will allow us to convert the site specific information from your waveform data into WFDB format. Please submit the file as <site_name>_wf_<telemetry_or_eeg>_mapping.csv
. If your site is submitting EEG waveforms in addition to telemetry waveforms, please provide a separate mapping file for the EEG waveforms.
The mapping file should have the following columns. For each signal in your data (e.g. ECG lead II, PPG, EEG lead FP1, etc.) add a row which supplies this information:
signal_id
: the (integer, or string) variable that indicates signal name in your submitted formatstandardized_label
: the standardized signal label name (string) that yoursignal_id
maps to. See standardized channel naming above for details on how to do this mappingsampling_rate
: the nominal sampling rate for the channel, in Hz (float). Please provide as much precision as is known to you (e.g. 249.987 Hz).units
: a string giving the units of the channel
Pair-based submissions should also have:
timestamp_id
: the (integer, or string) variable that identifies the master timestamp in your data corresponding to the start of a given sample. If your timestamps are implicit (start_time
+i
*sampling_rate
), this should point to the variable which has the start time for the waveform.
Stream-based submissions should also have:
start_time
: the (integer, or string) variable that identifies the start of the data stream for each row. Please remember to indicate (in yourSUBMISSION.md
) whether the timestamp corresponds to the start or end of the first sample in the stream.end_time
: the (integer, or string) variable that identifies the end of the data stream for each row. Please remember to indicate (in yourSUBMISSION.md
) whether the timestamp corresponds to the start or end of the last sample in the stream.
Submit your mapping file(s) to CHoRUS to WFDB converter repo. Provide the mapping file(s) under mappings/<site>/
5. Linking Waveform Sessions to Visits
A table named waveform_visit_links.csv
should be provided under the /OMOP/
folder. This table should have a column for visit_occurrence_id
, visit_detail_id
, and session_id
. If the data collection system does not provide a direct linkage between session_id
and visit_occurrence_id
, it may be necessary to infer this linkage based on the start and end time of the session but this should only be done as a last resort. Keep in mind that the session start and end may not exactly match the admission, discharge, or transfer time in the EHR. See Multimodal-Linkage for more detail.
Please outline the process you used for linking session_id
to visit_detail_id
in your SUBMISSION.md
file.
6. Revision History
Version | Date | Description |
---|---|---|
1.0 | 2024-11-21 | Initial version |
2.0 | 2025-02-10 | Clarified that waveforms should be submitted in preferred format. Updated file structure. |
7. Contact Information
- CHoRUS Waveform Working Group
- Email: Brian Gow
<briangow@mit.edu>
- Email: Brian Gow