Skip to main content

Waveform File Format

1. Purpose

This document defines the standard waveform file format for data contributions to the CHoRUS project. The purpose is to ensure consistent formatting of clinical waveform data submitted by participating hospitals. For a high level summary of the process for contributing waveform data to CHoRUS, see the SOP on contributing waveforms.


2. File Structure

All waveform data must adhere to the following directory structure to facilitate ingestion and processing.

2.1. Example File Structure

.
├── <person_id> # Patient-specific subdirectory
│   └── waveforms
│   └── <session_number>
│   ├── <session_number>.hea # Header file for waveform record <session_number>
│   └── <session_number>.dat # Signal file (second segment)
└── RECORDS # List of all record names (one per line)

2.2. Explanation of File Structure

  • <person_id>/:
    • Root directory for all patient data. Each patient should have a unique identifier (e.g., 10001).
  • waveforms/:
    • Patients may have multiple types of data. The waveforms folder is intended for grouping waveforms.
    • Inside each patient directory, files are organized by <session_number>.
  • <session_number>/:
    • Session number indicates a unique recording for the patient. A single patient (i.e. a single <person_id>) may contain multiple sessions.

Submissions in WFDB format:

  • Record files:
    • Each record consists of one or more:
      • .hea file/s: Metadata and configuration of the waveform record.
      • .dat file/s: Binary signal files, one or more per record.

Submissions not in WFDB format:

  • Should follow the naming structure outlined above but stored in the format being provided (e.g. hdf5)

3. Header File (.hea) Format

The WFDB .hea file is a text file containing metadata about the waveform record. It must follow the WFDB standard header format. Below is an example:

3.1. Example Header File: 12345.hea

12345 2 360 650000
12345.dat 16 200/μV 8 0 0 1234 0 ECG1
12345.dat 16 200/μV 8 0 0 1234 0 ECG2
# Age: 65
# Sex: Male
# Diagnosis: Myocardial Infarction

3.2. Explanation of Header Content

  1. First Line (General Record Info):

    • 12345: Record name (must match the filename prefix).
    • 2: Number of signals in the record.
    • 360: Sampling frequency in Hz.
    • 650000: Number of samples per signal.
  2. Subsequent Lines (Signal Descriptions):

    • 12345x.dat: Filename of the signal file.
    • 16: Storage format (16-bit integers).
    • 200/μV: ADC gain (e.g., 200 digital units per microvolt).
    • 8: ADC resolution (bits).
    • 0: ADC zero value.
    • 0: Initial value.
    • 1234: Checksum (sum of all samples modulo 2^16).
    • 0: Block size.
    • ECG1: Signal label (e.g., lead name).
  3. Comments (Optional):

    • Lines starting with # contain additional metadata, such as patient demographics and clinical information.
    • We discourage use of comments. Instead, metadata should be captured in the associated structured EHR data contribution.

4. Tools and Resources


5. Preparing Waveforms in the WFDB Format

All hospitals are strongly encouraged to submit waveform data in the WFDB (WaveForm DataBase) format. The WFDB Python toolbox automatically generates the .hea and .dat file structure when provided with the necessary information.

5.1. Segment Definition

In WFDB, a segment is a contiguous set of signals. Each segment has a .hea file and at least one .dat file. See below for more information on segments.

5.2. Time Shifting

If your EHR, waveform, and imaging data are being date-shifted, ensure that all data for a given subject is shifted by the same duration. See Multimodal-Linkage for more detail.

5.3 Gaps and Overlaps

When reviewing a waveform signal you may notice the following issues:

  • Intervals where data is missing, typically due to a network interruption or a disconnected device
  • Points where the timestamps of the data appear to jump forward or backward in time despite the data itself being continuous
  • Jitter in the timestamps in cases where they were not recorded by the original device

The goal in handling these issues is to maintain integrity of the waveform in order to enable machine analysis. Our suggested approach is to:

  • create a new segment whenever there is missing data. We do not recommend NaN filling except in unusual circumstances when starting new segments results in an unmanageable number of segments. In this case, you may use your discretion to NaN fill up to no more than 5 minutes.
  • concatenate the data if it is continuous
  • annotate each segment with the estimated start and end times, which might not exactly match the manufacturer's sampling rate

5.4. High Frequency Waveforms

High-frequency waveform files (e.g., ECG, EEG) should be saved as WFDB records. Use wfdb.io.wrsamp to write a single-segment record. For multisegment records, create a wfdb.io.MultiRecord object and use its wrsamp method to save the WFDB file.

A new segment should be created if either of the following conditions is met:

  • The length of the current segment exceeds 8 hours
  • The set of channels in the data changes. Specifically:
    • If a channel's data stream disappears or a new channel appears
    • If a channel's data is missing, as discussed in the section on Gaps and Overlaps

The exact sample rate (e.g. 249.97 Hz) needs to be passed when saving files in WFDB to ensure accurate timestamps. The device manufacturer should provide be able to provide the sample rate. While this may not be perfectly accurate, it is good enough initially.

5.4. Low Frequency Signals ("Numerics")

Low frequency signals (e.g. heart rate) should be saved as a CSV file in the same folder as the high frequency waveforms. The CSV file should be named as <session_number>n.csv. Please do not resample the low frequency signals in an effort to create a uniform time interval.

6. Preparing Waveforms in an Alternative Format

As discussed in the SOP on contributing waveforms, the WFDB format is strongly preferred. If providing an alternative format, detailed documentation about the format is required to facilitate conversion to WFDB. The documentation (in SUBMISSION.md) should include:

  • A description of each variable (e.g. StartTime represents the time at the start of the first sample in the message stream)
  • The data type for each variable
  • Units of each variable, if applicable
  • Details on how the data is encoded
  • Instructions for how to determine the sample number (or "monotonic time") at the start of each segment / message stream
  • Specification of the timestamp variables that represent the wall-clock time at the start and end of each segment / message stream
  • Details on whether sample intervals are uniform across waveform types and channels. High-frequency waveforms should have uniform sample intervals, but if this is not the case please explain why. We expect that numerics files may have non-uniform sample intervals.

7. Revision History

VersionDateDescription
1.02024-11-21Initial version

8. Contact Information

  • CHoRUS Waveform Working Group
    • Email: Brian Gow <briangow@mit.edu>

9. References