Skip to main content

Waveform File Format

1. Purpose

This document defines the standard waveform file format for data contributions to the CHoRUS project. The purpose is to ensure consistent formatting of clinical waveform data submitted by participating hospitals. For a high level summary of the process for contributing waveform data to CHoRUS, see the SOP on contributing waveforms.


2. File Structure

All waveform data must adhere to the following directory structure to facilitate ingestion and processing.

2.1. Example File Structure

.
├── <person_id> # Patient-specific subdirectory
│   └── waveforms
│   └── <session_number>
│   ├── <session_number>.hea # Header file for waveform record <session_number>
│   └── <session_number>.dat # Signal file (second segment)
└── RECORDS # List of all record names (one per line)

2.2. Explanation of File Structure

  • <person_id>/:
    • Root directory for all patient data. Each patient should have a unique identifier (e.g., 10001).
  • waveforms/:
    • Patients may have multiple types of data. The waveforms folder is intended for grouping waveforms.
    • Inside each patient directory, files are organized by <session_number>.
  • <session_number>/:
    • Session number indicates a unique recording for the patient. A single patient (i.e. a single <person_id>) may contain multiple sessions.

Submissions in WFDB format:

  • Record files:
    • Each record consists of one or more:
      • .hea file/s: Metadata and configuration of the waveform record.
      • .dat file/s: Binary signal files, one or more per record.

Submissions not in WFDB format:

  • Should follow the naming structure outlined above but stored in the format being provided (e.g. hdf5)

3. Header File (.hea) Format

The WFDB .hea file is a text file containing metadata about the waveform record. It must follow the WFDB standard header format. Below is an example:

3.1. Example Header File: 12345.hea

12345 2 360 650000
12345.dat 16 200/μV 8 0 0 1234 0 ECG1
12345.dat 16 200/μV 8 0 0 1234 0 ECG2
# Age: 65
# Sex: Male
# Diagnosis: Myocardial Infarction

3.2. Explanation of Header Content

  1. First Line (General Record Info):

    • 12345: Record name (must match the filename prefix).
    • 2: Number of signals in the record.
    • 360: Sampling frequency in Hz.
    • 650000: Number of samples per signal.
  2. Subsequent Lines (Signal Descriptions):

    • 12345x.dat: Filename of the signal file.
    • 16: Storage format (16-bit integers).
    • 200/μV: ADC gain (e.g., 200 digital units per microvolt).
    • 8: ADC resolution (bits).
    • 0: ADC zero value.
    • 0: Initial value.
    • 1234: Checksum (sum of all samples modulo 2^16).
    • 0: Block size.
    • ECG1: Signal label (e.g., lead name).
  3. Comments (Optional):

    • Lines starting with # contain additional metadata, such as patient demographics and clinical information.
    • We discourage use of comments. Instead, metadata should be captured in the associated structured EHR data contribution.

4. Tools and Resources

  • WFDB Conversion Tools:

5. Preparing Waveforms in the WFDB Format

All hospitals are strongly encouraged to submit waveform data in the WFDB (WaveForm DataBase) format. The WFDB Python toolbox automatically generates the .hea and .dat file structure when provided with the necessary information.

5.1. Segment Definition

In WFDB, a segment is a contiguous set of signals. Each segment has a .hea file and at least one .dat file.

5.2. Time Shifting

If your EHR, waveform, and imaging data are being date-shifted, ensure that all data for a given subject is shifted by the same duration. See Multimodal-Linkage for more detail.

5.3. High Frequency Waveforms

High-frequency waveform files (e.g., ECG, EEG) should be saved as WFDB records. Use wfdb.io.wrsamp to write a single-segment record. For multisegment records, create a wfdb.io.MultiRecord object and use its wrsamp method to save the WFDB file.

A new segment should be created if either of the following conditions is met:

  • The length of the current segment exceeds 8 hours
  • The set of channels in the data changes. Specifically:
    • If a channel's data stream disappears or a new channel appears
    • If a channel's data is missing for more than 5 minutes. For gaps of less than 5 minutes, fill the gap with NaNs

5.4. Low Frequency Signals ("Numerics")

Low frequency signals (e.g. heart rate) should be saved as a CSV file in the same folder as the high frequency waveforms. The CSV file should be named as <session_number>n.csv. Please do not resample the low frequency signals in an effort to create a uniform time interval.

6. Preparing Waveforms in an Alternative Format

As discussed in the SOP on contributing waveforms, the WFDB format is strongly preferred. If providing an alternative format, detailed documentation about the format is required to facilitate conversion to WFDB. The documentation should include:

  • A description of each variable (e.g. StartTime represents the time at the start of the first sample in the stream)
  • The data type for each variable
  • Units of each variable, if applicable
  • Details on how the data is encoded
  • Instructions for calculating the sample interval (e.g. EndTime - StartTime / # samples)
  • An indication of whether the sample intervals are uniform accross waveform types and channels.

7. Revision History

VersionDateDescription
1.02024-11-21Initial version

8. Contact Information

  • CHoRUS Waveform Working Group
    • Email: Brian Gow <briangow@mit.edu>

9. References