Early identification of stroke through deep learning with... : Neural Regeneration Research (2024)

Introduction

Stroke is a serious condition that can result in brain cell damage, causing permanent disability or death worldwide (Johnston et al., 2009; World Health Organization, 2014). There are two primary categories of stroke: ischemic and hemorrhagic (Grysiewicz et al., 2008). Ischemic stroke treatment is time-sensitive, and is currently limited to 4.5 hours for intravenous thrombolysis (Hacke et al., 2008), and up to 24 hours for endovascular thrombectomy (Nogueira et al., 2018). The benefits of treatment decline over time, such that prompt medical attention is critical to minimizing the effects of a stroke (Marler et al., 2000; Fonarow et al., 2011). Unfortunately, pre-hospital delays in stroke diagnosis often occur because emergency medical services and general practitioners fail to identify a stroke during the first patient assessment (Bouckaert et al., 2009), constituting a missed opportunity for timely medical treatment.

Early stroke diagnosis usually relies on a combination of image analysis and clinical symptom detection. The gold standard for diagnosing acute ischemic stroke is diffusion-weighted imaging, but this is not always available in emergency rooms. Non-contrast computed tomography is a more accessible alternative to diffusion-weighted imaging, but its sensitivity to the signs of stroke is relatively low (Kim-Tenser et al., 2021; Cai et al., 2022). During clinical symptom assessments, emergency medical services personnel are advised to use simple instruments such as the Cincinnati Pre-hospital Stroke Scale (CPSS; Kothari et al., 1999) and the Face Arm Speech Test (FAST; Harbison et al., 2003). The CPSS assesses facial palsy, motion weakness, and speech disturbance. The FAST scale builds upon the CPSS by simplifying a speech disturbance evaluation while retaining only three essential components (facial weakness, arm weakness, and speech disturbance) to simplify the stroke assessment process. Although these scales are widely used to conduct early screening of emergency stroke, they may fail to detect subtle motion deficits. Hence, more precise and sensitive methods for diagnosing stroke are needed, especially those that can be conveniently utilized by emergency medical services personnel in emergency situations.

In recent years, many researchers have used deep learning to construct systems that could assist clinicians in diagnosing patients, with a goal of reducing the workload of medical staff (Wen et al., 2020; Yao et al., 2020; Uemura et al., 2021; Aberathne et al., 2023; Khaliq et al., 2023; Daidone et al., 2024). Given the importance of face and arm movements in the FAST and CPSS scales, recent developments in motion analysis have been particularly relevant to improving stroke diagnosis (Feichtenhofer et al., 2018; Tran et al., 2019; Feichtenhofer, 2020). However, these methods are limited in that their models use supervised learning methods to obtain classification-specific knowledge, and so it is difficult to transfer this knowledge to the stroke diagnosis domain. Moreover, current video classification frameworks are unable to process audio data from patient speech, and most multi-modal methods treat information from different modalities as equally important (Liang et al., 2015; Zhou et al., 2020a, b), even when the modalities produce information with distinct levels of accuracy or relevance. For stroke diagnosis, recent studies have proposed novel multi-modal deep learning frameworks based on the CPSS and FAST scales that assess the presence of stroke according to facial motion weaknesses and speech inability (Yu et al., 2020; Cai et al., 2022). However, these systems focus solely on facial motion as the video input, and do not consider limb motion. Additionally, they treat information from different modalities as having equal importance. As patient audio data contains a large amount of irrelevant information, combining methods may introduce noise, making it difficult to effectively use the valuable information contained in patient motion videos. Consequently, it is essential to pursue the further development of multi-modal techniques that can effectively process diverse data modalities with varying levels of importance.

In this study, we tested an end-to-end multi-modal model that could integrate information across modalities of different importance. The information was obtained through stroke classification tasks using video contrastive learning (Dave et al., 2022) and an asymmetric information fusion module. The model had three components: a video module for extracting patient motion features, an audio module for extracting patient speech features, and a multi-modal data fusion module. In the video module, we leveraged video contrastive learning to pre-train an encoder and obtain a motion-specific representation. For the audio module, we used the VGGish (Hershey et al., 2017), a pre-trained convolutional neural network, as an audio acoustic feature extractor for use with patient speech audio. In the multi-modal information fusion module, we adopted the asymmetrical multi-modal attention (AMMA) mechanism (Wang et al., 2021) to fuse the information from the two modalities in an asymmetric way. The models were assessed via five-fold cross-validation. Through this experimental investigation, our goal was to accurately evaluate the effectiveness and constraints of this multimodal diagnostic system with respect to stroke diagnosis, providing evidence for its potential in future clinical applications.

Methods

Participants

This was a cross-sectional observational study of data collected from stroke patients who were admitted within 7 days of symptom onset. Our study sample included all patients admitted to Guangdong Provincial People’s Hospital between June 2021 and May 2023. The inclusion criteria for participants include patients with either slight or severe symptoms of stroke and availability for assessments. The exclusion criteria include participants who were unable to cooperate with a neurological examination. According to previous studies (Yu et al., 2020; Cai et al., 2022), we recruited 132 stroke patients and 121 healthy controls who reported no history of stroke. The healthy controls involve a team of staff members from our hospital, as well as other patients who have not been diagnosed with a stroke. All participants provided informed consent. The research protocol was approved by the Institutional Review Board of Guangdong Provincial People’s Hospital (approval No. KY-Z-2021-431-03 on May 20, 2021 and KY-Z-2021-431-05 on July 4, 2023). This study was performed in accordance with the principles included in the Declaration of Helsinki. This study was reported according to the STrengthening the Reporting of OBservational Studies in Epidemiology (STROBE) guidelines (von Elm et al., 2007; Additional file 1).

Experimental design

This experiment was divided into two phases: the development phase and the validation phase. The participants were randomly assigned to nonoverlapping training (80%) or validation (20%) sets. In the development phase, we used the training set to train the deep learning model. In the validation phase, we employed five-fold cross-validation, wherein the dataset was randomly divided into five subsets. Four subsets were used for training, and one was used for validation. The results were averaged for evaluation, ensuring a comprehensive assessment of the model’s accuracy and reliability.

Our experiment falls under the category of experimental research with the objective of developing and validating new diagnostic tools or methods. Therefore, we conducted randomized controlled trials, which allowed us to evaluate the effectiveness and accuracy of the deep learning models in comparison with other diagnostic methods. Specifically, we established a randomized control group in which diagnoses were conducted using deep learning techniques.

Dataset construction

The clinical data for this study were collected at the Guangdong Provincial People’s Hospital by physicians and caregivers as part of this stroke diagnosis study. The study subjects were patients who visited the Department of Emergency Medicine with slight or severe symptoms of a stroke, as well as a group of healthy controls. To protect the patients’ personal information, only necessary information was collected for this study.

During the dataset collection, each subject was instructed to perform four different types of motions (hand lifting, leg lifting, pointing at the nose, and facial movements). Audio recordings were also made of patients repeating everyday phrases, with the purpose of evaluating the presence of language or speech impairments. These activities were recorded in the form of four videos and one audio file, as illustrated in Figure 1. These tasks were designed to enable the model to evaluate patient speech and limb ability in terms of both cognition and motion. In this study, the function of the model was to perform binary classification to distinguish between stroke and non-stroke subjects, which is sufficient for early stroke diagnosis.

Development of the multi-modal transformer algorithm

Video features extraction network

We used a contrastive learning method (He et al., 2020) to pre-train an encoder that extracted action features from the patient videos. Our contrastive learning method was a contrastive representation learning method for video feature extraction (Yao et al., 2021). We pre-trained our visual encoder to efficiently extract video features via three pretext tasks: an inter-frame instance discrimination task, an intra-frame instance discrimination task, and a temporal order validation task. We used ResNet50 (He et al., 2016), a deep learning neural network, as the backbone of our video contrastive learning approach. Similar to previous work (Chen et al., 2020), we add a multi-layer perceptron (MLP) head for each video contrastive learning task that was not involved in the downstream classification task. The image patch size (H × W × C) was set to 224 pixels × 224 pixels × 3 channels, and the MLP head embedded the global features via 128-dimensional embedding. We set the momentum coefficient α to 0.999 and the temperature τ of the infoNCE (He et al., 2020) loss to 0.1. InfoNCE is an innovative loss function designed for unsupervised learning, which calculates the similarity between two data distributions by assessing their mutual information, serving as the basis for model training. During contrastive learning, we optimized the parameters of ResNet50, the encoder, using stochastic gradient descent with a learning rate of 0.2. The network was trained using 300 epochs based on the ImageNet pre-trained model. After contrastive learning, we used the ResNet50 encoder without the MLP head to obtain the 2048-dimensional patient action features. We then concatenated the action features of the four types of action videos into an action feature sequence, which are submitted to a transformer to obtain the final stroke-related action features.

Audio feature extraction network

Since the raw speech audio could not be efficiently encoded, we used a fast Fourier transform to obtain an audio spectrogram. This provided a visual representation of the spectrum of frequencies present in the audio signal. To facilitate the use of convolutional neural networks, we chose the Log-Mel spectrogram to represent the audio data. This representation is based on a logarithmic scale that closely mimics human hearing, enabling a natural representation of audio data (Furui, 1986). Additionally, the Log-Mel spectrogram is robust to variations in audio features such as loudness and pitch. To create the Log-Mel spectrogram, we utilized the Mel Scale (Furui, 1986), a function that allows us to transform an audio spectrogram into the Log-Mel spectrogram. As the Log-Mel spectrogram resembles an image, it is possible to adapt deep learning models used in image analysis for audio analysis. One such model is the VGGish network, which can extract useful acoustic features from speech audio. This model leverages an existing VGG network and pre-trains it on the YouTube-100M dataset (https://netsg.cs.sfu.ca/youtubedata/) to capture temporal information from audio signals. The audio data are initially converted to a single channel using a 44,100 Hz sampling rate, and subsequently divided into non-overlapping 960-ms frame clips. To extract relevant features from the audio data, we performed a fast Fourier transform with a window that had a width of 25 ms and 10 ms steps. We then applied a Log-Mel transform to each audio clip, resulting in a sequence of 64 Mel-spaced frequency bin spectrograms with a size of 100 × 64. Finally, these spectrogram sequences were fed into a pre-trained VGGish network, resulting in a 128-dimensional embedding vector for each audio clip.

Multi-modal feature fusion

The core function of our model was to effectively fuse the above video and audio features. In our multi-modality fusion module, we adopted a transformer-based encoder (Vaswani et al., 2017) with the structure of a linear projected layer, two instances of layer normalization (Ba et al., 2016), the AMMA (Wang et al., 2021) and the MLP layer. The AMMA model is based on the self-attention mechanism. Through this mechanism, the AMMA can automatically learn the weights for features from two different modalities, eliminating the need for tedious hyperparameter tuning. In contrast to the original attention mechanism, where every element in the sequence was influenced by all other elements in a fully connected graph-like structure, the AMMA fused information from the two modalities in an asymmetric manner. This enabled the model to learn correlations and interactions solely within the video representation while reducing the influence of noisy audio information. The multi-modal transformer in this study consisted of 4 layers, with the number of heads in the multi-headed attention mechanism set to 4. Both the audio and video features were aligned to 512-dimensional embedding and concatenated as the input of the transformer. During the training process, we utilized the Adam optimizer (Kingma and Ba, 2014), an adaptive optimization algorithm well-suited for large-scale training of deep neural networks. The learning rate was set to 0.0001, with beta_1 set to 0.9 and beta_2 set to 0.999, and weight decay set to 0.00008.

Experiment platform

The models were trained using Pytorch 1.4.0 (Facebook Artificial Intelligence Research, Menlo Park, CA, USA) on Ubuntu 18.04 (Canonical Ltd., London, UK) with an Intel Xeon Gold 6242 CPU (Intel Corporation, Santa Clara, CA, USA) and a GTX3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The video and audio data were obtained using mobile phones at Guangdong Provincial People’s Hospital.

Baseline methods

As baseline measurements, we selected six prior models that achieved good performance in action classification methods, including the I3D (Carreira and Zisserman, 2017), SlowFast (Feichtenhofer et al., 2018), X3D (Feichtenhofer, 2020), TPN (Yang et al., 2020), TimeSformer (Bertasius et al., 2021), and MViT (Fan et al., 2021). The I3D, SlowFast, and X3D apply 3D networks to video processing. The TPN is based on exploiting temporal sequences. In recent studies, the TimeSformer and MViT, which employ Transformer-based architectures (Vaswani et al., 2017), have shown remarkable efficacy. Among these methods, the MViT has achieved state-of-the-art accuracy on various video recognition tasks. These methodologies were employed to calculate baselines for a comparative evaluation with our proposed approach.

Results

Table 1 shows the evaluation metrics for our model and the baseline models, and Figure 2 shows a plot of the receiver operating characteristic curves for the baseline models after processing our stroke dataset. The results show that our proposed multi-modal stroke assessment model outperformed all state-of-the-art methods, achieving the highest area under the receiver operating characteristic curve (AUC) value for stroke diagnosis prediction. Moreover, the multi-modal model outperformed its single-module variants, highlighting the benefit of utilizing multiple types of patient action videos and speech audio. Both the video module and audio module achieved an AUC level of around 80%, indicating that these modules have clinical potential in terms of using patient action videos or speech audio to predict the probability of stroke. Additionally, Table 1 reveals that the video module alone produced higher metric values than the audio module alone, suggesting that the action feature was more valuable than the (potentially noisy) audio data.

Discussion

Analysis of results

Comparison with other state-of-the-art methods revealed that our video module outperformed several previous models. This result underscores the effectiveness of our approach, which leverages contrastive learning methods to generate high-quality features for downstream stroke prediction. Moreover, the collaborative use of our video and audio modules yielded significant improvements in performance compared with each modality on its own. This finding suggests that the two modalities contain discriminative information that is complementary when used together. Our model was able to learn and integrate these two types of information effectively, resulting in improved accuracy and reliability for stroke diagnosis.

Based on the results presented above, we discuss several key factors that contributed to the improvements of our model. First, our video module benefited from contrastive learning, which enabled it to obtain motion-specific representations that are highly suitable for stroke diagnosis. These representations take advantage of the inherent structures and correlations between video frames, resulting in more accurate and informative representations. Second, instead of relying solely on video data, we used the FAST and CPSS scales and incorporated the patient speech audio using a pre-trained VGGish network to create a multi-modal stroke diagnosis framework. This approach enabled us to capture a wider range of diagnostic information, improving the accuracy and reliability of the model. Finally, the use of the AMMA mechanism allowed our model to effectively address the acoustic noise in the audio feature. Specifically, this mechanism enabled our model to leverage the valuable information contained in both the action and acoustic features while preventing acoustic noise from interfering with the other features. Overall, our model was able to accurately diagnose stroke, and thus could provide valuable insights to clinicians.

Visualization analysis

We performed a visualization analysis using class activation maps (Zhou et al., 2016) to demonstrate the effectiveness of the action features extracted by our video modules via contrastive learning. We generated gradient-class activation map visualizations using the output features of the video modules, as shown in Figure 3. In the first case, while both models output the correct decision, the supervised learning model focused on the patient’s white clothes to determine if they had experienced a stroke, whereas our model using contrastive learning focused on the action-related limbs to make a decision. In the second case, our model accurately identified that the patients did not have a stroke based on their motion features, but the supervised model mistakenly classified patients as having had a stroke because it overemphasized the hospital bed in the patient’s surroundings.

Our interpretation of these results is that supervised learning allowed the model to pick up on patterns during training so that each class was tightly grouped in the feature space, and the patients could be classified into their target classes. However, the model may overemphasize a spurious image pattern for out-of-domain samples, leading to supervision collapse. For instance, as shown in Figure 4, the model with supervised learning might learn that the subjects wearing white clothes are more likely to have had a stroke. However, this is irrelevant to stroke diagnosis and could result in misdiagnosis.

In summary, compared with previous approaches, the proposed model shows substantial superiority in terms of extracting action-related features for stroke diagnosis. The proposed model learned the motion-specific features of the subjects through contrastive learning tasks without labels, which could enforce a framework in which the specific subjects’ actions are a focus. Therefore, the proposed model could learn the features of the action effectively, enabling accurate prediction of stroke diagnosis.

Ablation analysis

We analyzed the effectiveness and importance of four types of patient action videos in predicting stroke. The input of single-modal action classification models was modified using only one type of action video feature. Additionally, to further investigate the performance of the multi-modal stroke prediction model, we combined each type of action video feature with the acoustic audio feature using the multi-modal fusion module.

We summarize the results for each type of action in Figure 4. The results show that the features of the nose pointing task achieved the highest performance, both for the single-modal and multi-modal approaches, while the lower limb task had the worst performance. These results suggest that the nose pointing task may have had a larger influence on the stroke diagnosis prediction of our model, while the lower limb motion task had a low impact. We hypothesize that comprehensive motion like that involved in the nose pointing task played a stronger role in the stroke diagnosis performance of our model, while the stroke-related changes in lower limb motion might have been more difficult to detect in this task setting because these abnormalities could be relatively subtle. After comparing each type of action with the multi-modal data, we hypothesize that all types of action could leverage the information from the acoustic features of patient speech to obtain higher performance.

General discussion

The early symptoms of the stroke are not obvious, and patients often do not notice them. This can easily lead to confusion between stroke and other diseases. When later, more prominent symptoms appear, the time left to rescue the patient is often just a few hours. If not diagnosed and treated promptly, this can lead to severe complications and even permanent neurological damage. In most areas of China, civilian awareness of stroke prevention is poor, and many people are not able to identify the early symptoms of stroke (Tu et al., 2022). Additionally, some areas in China face a shortage of medical resources such that medical professionals cannot meet patients’ needs for timely medical consultation or regular screenings. An artificial intelligence-based model for early detection of stroke, using a multimodal feature fusion transformer method, could be especially helpful in these settings. Such systems could provide an intuitive warning to patients based on the FAST scale.

The multimodal stroke detection model based on the Transformer proposed in this study had an average AUC of 88.2% and an average accuracy of 82.6%. However, there were several limitations to our study. First, the study sample was small, which could mean that the variety of stroke patient types was low. This could lead to inconsistent early screening results for different types of stroke. Insufficient video and audio data affect the quality of training and testing data. Hence, there is significant room for improving the generalization ability of the artificial intelligence model (LeCun et al., 2015). Second, stroke patients can be divided into anterior cerebral ischemic syndrome and posterior cerebral ischemic syndrome types, and for the latter, the commonly used stroke scales cannot fully reflect the signs and symptoms, even with medical imaging. In future work, these problems could be overcome by increasing the amount and types of data to enhance the model training dataset (Radford et al., 2021).

Limitations

There were several limitations to the present study that require addressing in future work. Firstly, the limited number of training samples prevented us from training a more efficient neural network, as deep learning models have a large number of parameters, making it difficult to train them with a small dataset. Secondly, the computational complexity of the entire pipeline makes it impractical for mobile deployment of this system. In situations with limited network bandwidth, the transmission of collected data can become a significant bottleneck, hindering the efficiency of the screening process. Future research should continue collecting samples to expand the training dataset and also focus on making lightweight modifications to the model framework to achieve improved clinical validation results.

Considerations for clinical applications

One of the key features of our approach is the use of video and audio data as inputs for the artificial intelligence model. Our system is advantageous because the cost of recording audio and video is low, the procedure is simple, and the penetration rate of smartphones in China is high, making home use by patients feasible. Our early deep learning recognition model for stroke could serve as an early screening tool for stroke. Patients could perform self-checks, especially in underdeveloped regions and areas where medical resources are scarce. Our model could help provide convenient and low-cost early screening, thus promoting early detection, diagnosis, and treatment of disease. For doctors, this model could help to alleviate some of the pressure associated with the demands of screening many patients. Finally, the cost of thrombolytic therapy for stroke is high, and given China’s large population and high incidence, early detection, diagnosis, and prevention could reduce some of the burden on the national medical insurance system.

Conclusion

The main contributions of this work are summarized as follows: 1) we proposed a novel end-to-end deep learning model for multi-modal stroke diagnosis and constructed a novel multi-modal stroke diagnosis dataset to validate its performance; 2) we addressed the influence of noisy information from audio by using the AMMA to integrate video and audio information in an asymmetrical way; 3) we enhanced the efficiency of transferring representations to the stroke domain through video contrastive learning. Although there are certain limitations in terms of data volume and deployment difficulty, we can overcome these challenges by collecting more data and using more advanced models. In conclusion, we have developed an artificial intelligence model that significantly improves the accuracy of stroke diagnosis. This innovation is still effective even in settings with limited resources. By employing a multi-modal approach, we have achieved a substantial improvement in diagnostic precision when compared to conventional methods. This innovation represents a promising advancement in stroke diagnostics, providing medical professionals with more accurate tools for assessing strokes.

Author contributions:Study design: HW and WL; conception and supervision: WL and XL; investigation: ZO, HW, and HL; data collection: BZ, BH, LR, YL, YZ and CD; manuscript drafting: ZO and HW. All authors read and approved the final manuscript.

Conflicts of interest:The authors declare that they have no conflicts of interest.

Data availability statement:All relevant data are within the paper and its Additional files.

Additional file:

Additional file 1:STROBE checklist.

C-Editor: Zhao M; S-Editors: Yu J, Li CH; L-Editors: Yu J, Song LP; T-Editor: Jia Y

References

Aberathne I, Kulasiri D, Samarasinghe S (2023) Detection of Alzheimer’s disease onset using MRI and PET neuroimaging: longitudinal data analysis and machine learning. Neural Regen Res 18:2134–2140.

Google Scholar

Ba JL, Kiros JR, Hinton GE (2016) Layer Normalization. arXiv:1607.06450.

Google Scholar

Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? arXiv:2102.05095.

Google Scholar

Bouckaert M, Lemmens R, Thijs V (2009) Reducing prehospital delay in acute stroke. Nat Rev Neurol 5:477–483.

PubMed |
Google Scholar