Here, you’ll find more information about the paper and models. Detailed information about the code and how to use it is in the GitHub repository. And requirements for the equipment and input data are explained in Requirements.

Paper’s Results


Meta-analysis comparison of PSG- and CSG-based sleep staging

The main takeaway from the paper is that our model performs equivalently to human-scored PSG. To make this comparison, we performed a meta-analysis using the data from 11 human-to-human studies to get the overall and stage-specific kappas that can be expected for humans (i.e., the random-effects estimates). Then, we compared our results against the human performance using non-inferiority testing.

Again, the paper provides significantly more details, including: breakdowns by source study, comparison with EEG-less studies, robustness to noise and other perturbations, and the real-time performance.


Performance against meta-analysis of human scoring performance alter-text
Forest plots for overall and stage-specific kappas and our model's performance. At the top are the forest plots for the random-effects estimates for overall and each stage-specific Cohen's kappa (κ) for human-to-human inter-rater agreement on PSG. We list the source inputs chronologically, with each mean kappa and CI to the right. Per convention, the gray square for each represents their weight. The unfilled black diamonds represent the random-effects estimates (width representing each estimate's 95% CI), and the whiskers extending from the diamonds represent the 95% PIs. The black vertical dotted lines also indicate the random-effects estimates. At the bottom are the overall and five-stage kappas of our primary model based on single-lead ECG evaluated on our testing set. All non-inferior comparisons using a t-test are significant (i.e., noninferior; *p < 0.025)—except for N3 (not non-inferior).


Data and Rigor

Below are two important details about the diversity of the data and the methodological rigor. However, the paper provides significantly more details.

Data diversity

In addition to achieving expert-level performance, another primary goal was to create the most generalizable model possible. This required training and testing on a diverse pool of subjects, with the current U.S. census as the target. To meet that requirement, we used recordings from five large sleep studies (CCSHS, CFS, CHAT, MESA, WSC; all from NSRR) to build the desired dataset.

Statistics of sleep datasets alter-text
(left) We aimed to select subjects (blue lines) to match the U.S. census statistics (gray lines) by age (solid lines) and sex (dashed lines). The lack of subjects in decades 3–5 is a limitation of the available datasets (decade 1 = age 0–9yr.), with subjects added to other decades to achieve the same mean age as the census data. (right) The 4000 recordings came from five studies, with the distribution of the subjects’ ages in decades shown.

Methodological rigor

We took significant measures to address the common methodological shortcomings in literature. Here are few bullets on the rigor:

  • We took all recordings as-is and did not trim wake periods before the subject fell asleep (mean sleep latency = 1.3 ± 1 h.) or after the subject woke up.
  • 4000 recordings were randomly selected from the larger pool of 5718 recordings that met our quality criteria. Those recordings were then randomly assigned to a set, with the target of all three sets having matching distributions of age, sex, and source study.
  • Testing set: 500 unique recordings from unique individuals. No individuals from the validation or training set.
  • Validation set: 500 recordings, but more than one recording is allowed from the same individual. No individuals from the testing or training set.
  • Training set: 3000 recordings, but more than one recording is allowed from the same individual. No individuals from the testing or validation set.

Available Models

There are currently three models in the repository that anyone can use to score their own ECG data:

  1. Primary model
    • This model is designed to take an entire night’s worth of sleep and score it all at once.
  2. Real-time model
    • This model is designed to take just the data that has been recorded up till “now” and score it all at once. As each new 30-second epoch of data is recorded, the model can be run again.
  3. NEW Primary model without demographics
    • This model was not in the paper, having been trained just recently.
    • The model works the same way as the primary model, except you do not need to provide the demographics (age and sex) of the subject.
    • There is a slight (<1%) performance impact, as measured with Cohen’s kappa, compared to the original primary model.

History and Plans

You can read more about the history and future plans, here.

References

Our paper

If you find this repository helpful, please cite the paper:

Additional works cited

Readers and Mentions

Expert-level sleep staging using an electrocardiography-only feed-forward neural network