MoreRSS

site iconHackerNoonModify

We are an open and international community of 45,000+ contributing writers publishing stories and expertise for 4+ million curious and insightful monthly readers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of HackerNoon

RNN-Aided Dead-Reckoning: IMU-Only Navigation for Cost-Efficient AUV Fleets

2025-09-09 17:59:53

:::info Authors:

(1) Ivar Bjørgo Saksvik, Department of Mechanical, Electronic and Chemical Engineering Oslo Metropolitan University, Oslo, Norway ([email protected]);

(2) Alex Alcoce,r Department of Mechanical, Electronic and Chemical Engineering Oslo Metropolitan University, Oslo, Norway ([email protected]);

(3) Vahid Hassani, Department of Mechanical, Electronic and Chemical Engineering Oslo Metropolitan University, Oslo, Norway ([email protected]).

:::

Table of Links

Abstract and I. Introduction

II. Dead-Reckoning Navigation

III. Neural Network Aided Dead-Reckoning Navigation

IV. AUV Platforms and Datasets

V. Experimental and Simulation Results

VI. Conclusions and Further Work

VII. Acknowledgments and References

\ Abstract—This paper presents a deep learning approach to aid dead-reckoning (DR) navigation using a limited sensor suite. A Recurrent Neural Network (RNN) was developed to predict the relative horizontal velocities of an Autonomous Underwater Vehicle (AUV) using data from an IMU, pressure sensor, and control inputs. The RNN network is trained using experimental data, where a doppler velocity logger (DVL) provided ground truth velocities. The predictions of the relative velocities were implemented in a dead-reckoning algorithm to approximate north and east positions. The studies in this paper were twofold I) Experimental data from a Long-Range AUV was investigated. Datasets from a series of surveys in Monterey Bay, California (U.S) were used to train and test the RNN network. II) The second study explore datasets generated by a simulated autonomous underwater glider. Environmental variables e.g ocean currents were implemented in the simulation to reflect real ocean conditions. The proposed neural network approach to DR navigation was compared to the on-board navigation system and ground truth simulated positions

\

I. INTRODUCTION

AUTONOMOUS UNDERWATER VEHICLES (AUVS) have in the last decades become important tools in ocean research. Untethered from umbilical cables, these vehicles are suitable for a high variety of applications including bathymetric mapping, water sampling and environmental monitoring. A notorious challenge for AUVs is to navigate and georeference acquired sensor data during operations as GPS signals can’t propagate trough water. Conventional solutions to this issue involve adding acoustic navigational or/and positioning instruments to the AUV payload. Due to the good propagation of sound in water, doppler velocity loggers and acoustic baseline systems are considered the backbone in AUV navigation and underwater positioning [10], [25]. However, these traditional sensors are often expensive and consumes large amounts of power. In AUV fleets, the cost of adding acoustic instruments is compounded with the number of vehicles. In this paper we consider a limited sensor suite consisting of an IMU sensor and a pressure transducer, where acoustic instruments are partially available to collect experimental training data. Collected DVL velocity measurements from only a few missions are used as a reference in supervised neural network training. The aim for the trained network is to complement DR navigation when the DVL sensor is inaccessible, for example in AUV fleets with budget limitations.

\ The absence of acoustic navigational and positioning instruments has traditionally been compensated by modelbased observers like Extended Kalman Filters (EKFs). These are derived from AUV dynamics to form an estimation model [8], [9], [21], [22]. Unfortunately, model-based observers rely on parameters that are difficult to obtain in practice. The dynamics of an AUV is derived based on intricate hydrodynamic models. Experiments must be carried out in a towing-tank facility or using expensive CFD (Computational Fluid Dynamics) software to obtain hydrodynamic damping coefficients [28], [3]. If the external geometry of the AUV changes, i.e. when making small modifications to payload sections, the coefficients need to be updated.

\ To avoid deriving complex AUV models and conducting time consuming towing-tank or CFD experiments, this paper presents a data-driven approach to dead-reckoning navigation. Using experimental data from AUV missions and simulations, a neural network is trained to learn and generalize relative AUV motions. Data-driven neural network regression abolishes the need for knowledge of a dynamic model, and avoids modelling and estimation errors related to classical state observers [4], [5]. A recurrent neural network (RNN) is developed to relax time-delayed effects in the AUV dynamics which occurs due to vehicle inertia, under actuation and added mass effects [1], [6]. With an input layer composed of standard sensory measurements (pressure sensor, inertial measurement unit) and control actions, the RNN network aims to predict relative surge ur and sway vr velocities. These are further implemented in a dead-reckoning algorithm to approximate North and East positions during operations.

\

A. Related Work

An estimation for marine crafts. In Zhang et al. [4] a Short-Term Long-Term-Memory (LSTM) recurrent neural network is proposed to estimate the relative position of an AUV. The LSTM network used data from a pressure sensor, an inertial measurement unit (IMU), and an acoustic doppler velocity logger (DVL) to predict the horizontal north and east positions. Training and validation data were collected from a series of surface trajectories while logging GPS locations, which were projected as ground truth measurements. A similar study with the same AUV is presented in Mu et al. [5], where a bi-directional LSTM network was used. A neural network approach to dead-reckoning navigation of dynamically positioned ships is presented in Skulestad et al. [6]. Control actions and commands from vessel thrusters combined with heading measurements was used as input data in a RNN network to aid navigation during GNSS outages. Experiments were conducted in a vessel simulator with time-varying environmental disturbances such as wind forces, sea waves and ocean currents. In Chen et al. [13] a neural network is presented to assist navigation during DVL malfunction. A nonlinear autoregressive network with exogenous SINS (Strapdown Inertial Navigation System) inputs was used. The network was tested and validated on a ship with a DVL mounted on the vessel hull to provide training and validation data.

\ The remaining parts of this paper are detailing the following segments - Section II and III addresses the concept of dead-reckoning navigation and the neural network velocity observer respectively. Section IV presents the AUV platforms and datasets used to train and test the neural networks. The results are detailed in section V and the conclusion and recommendations for further work are presented in VI.

\

II. DEAD-RECKONING NAVIGATION

\

\

\ \ \ Fig. 1. DR navigation Illustration

\ \

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\

Patient-Specific CNN-RNN for Lung-Sound Detection With 4× Smaller Memory

2025-09-09 17:38:50

Table of Links

Abstract and I Introduction

II. Materials and Methods

III. Results and Discussions

IV. Conclusion and References

III. RESULTS AND DISCUSSIONS

A. Generalized Model

Firstly, we evaluated our model on four class breathing cycle using micro and macro metrics described in section II-B. To compare our results with traditionally used CNN architectures, we used VGGnet and Mobilenet. All models were trained and tested on a workstation with Intel Xeon E5- 2630 CPU with 128GB RAM and NVIDIA TITAN Xp GPU. The results are tabulated in table II. The results are averaged over five randomized train-test sets. As it can be seen from the table II, the proposed hybrid CNN-RNN model trained with data augmentation produces state of the art results. Both VGG-16 and Mobilenet produces slightly lower scores both in terms of macro and micro metrics. The score obtained by the proposed model also outperform results reported by Kochetov et al. using noise labels on a similar 80-20 split (Table I). We have also performed a 10-fold cross-validation on the dataset for our proposed model and the average score obtained is 66.43%. Due to unavailability of similar audio datasets in biomedical field, we have also tested the proposed hybrid model on Tensorflow speech recognition challenge [47] to benchmark its performance. For an eleven-class classification with 90% − 10% train-test split, it produced a respectable accuracy of 96%. For the sake of completeness, we also tested the dataset using same train-test split strategy with a variety of commonly used temporal and spectral features (RMSE, ZCR, spectral centroid, roll-off frequency, entropy, spectral contrast etc. [48]) with non-DL methods such as SVM, shallow neural network, random forest and gradient boosting. The resulting scores were significantly lower (44.5 − 51.2%).

\

B. Patient Specific Model Tuning Strategy

It has been shown by Chambres et al. [43] that though it is difficult to achieve high scores for breathing cycle level classification, it is much easier to achieve high accuracy in patient level binary classification (healthy/sick). Hence, we propose a screen and model tuning strategy. First, the patients are screened using the pre-trained model and if a patient is found to be unhealthy, the pre-trained model is retrained on the patient data to build the patient specific model that can monitor the patient condition in future with higher reliability. The proposed model is shown in Fig. 3. To evaluate the performance of the proposed methodology, we used leave one out validation. Since there are variable number of recordings from each patient in the dataset, of the n samples from a patient, n − 1 samples are used to retrain the model and it is tested on the other sample. This method is repeated so that all the samples are in the test set once. We trained the proposed model on the patients in the train set and evaluated it on the patients in test set. Since leave one out validation is not possible for patients with only one sample, we only considered patients with more than one sample. The dataset contains different number of recordings from each patient and the length of the recordings and number of breathing cycles in each recording vary widely. But on average, ≈ 47 patient breathing cycles are used for the fine-tuning of patient specific models.

\

\ Secondly, since we are using patient specific data to train the models, we have to verify if our proposed model tuning strategy provides any advantage over a simple classifier trained on only patient specific data. To verify this, we used an ImageNet [49] trained VGG-16 [40] as a feature extractor along with an SVM classifier to build patient specific models. Variants of VGG trained on ImageNet dataset have been shown to be very efficient feature extractors not only for image classification, but also for audio classification [50]. Here we use the pre-trained CNN to extract features from patient recordings and train an SVM based on those features only on the patient specific data.

\ Thirdly, we are proposing that by pre-training the hybrid CNN-RNN model on the respiratory data, the model learns domain specific feature representations that are transferred to the patient specific model. To justify this claim, we trained the same model on tensorflow speech recognition challenge dataset [47] as well as urban sounds 8K dataset [51]. Then we used the same model tuning strategy to re-train the model on patient specific data. If the proposed model learns only the audio feature specific abstract representations from the data, then a model trained on any sufficiently large audio database should perform well. But, if the model learns respiratory sound domain specific features from the data, the model pre-trained on respiratory sounds should outperform the model pre-trained on any other type of audio database. Finally, we compare the results of our model with pure CNN models VGG-16 and MobileNet using the same experimental methodology.

\ The results are tabulated in table III. Firstly, Our proposed strategy outperforms all other models and strategies and obtains a score of 71.81%. Secorndly, VGG-16 and MobileNet achieves scores 68.54% and 67.60% which signifies pure CNNs can be employed for respiratory audio classification, albeit not as effective as a CNN-RNN hybrid model. Thirdly, results corresponding to both audio trained networks shows that audio domain pre-training is not very effective for respiratory domain feature extraction. We explain this observation in further details in section IV. Finally, Imagenet trained VGG16 shows promise as a feature extractor for respiratory data,

\ TABLE IICOMPARISON OF RESULTS

\ Fig. 3. Screen and model tuning strategy: First the patients are screened into healthy and unhealthy based on % of breathing cycles predicted as unhealthy. For patients predicted to be unhealthy, trained model is re-trained on patient specific data to produce patient specific model which then performs the four class prediction on breathing cycles.

\ TABLE IIICOMPARISON OF PATIENT SPECIFIC MODELS

\ although it does not reach the same level of performance as ICBHI trained models.

\

C. Memory and Computational Complexity

Even though the proposed models show excellent performance in the classification task, the memory requirement for storing huge number of weights for these models make them unsustainable for application in mobile and wearable platforms. Hence, we apply the local log quantization scheme proposed in section II-D4. Figure 4 shows the score achieved by the models as a function of bit precision of weights. As expected, VGG-16 outperforms the other to models due to its over-parameterized design [38]. MobileNet shows particularly poor performance in weight quantization and is only able to achieve optimum accuracy at 10 bit precision. This poor quantization performance can be attributed to large number of batch-normalization layers and RELU6 activation of MobileNet architecture [38]. While several approaches have been proposed to circumvent these issues [52], these methods are not compatible with Imagenet pre-trained MobileNet model since they focus on modifications in the architecture rather than quantization of pre-trained weights. The hybrid CNNRNN model performs slightly worse than VGG-16 since it has LSTM layer which requires higher bit precision compared to the CNN counterpart [53].

\

\ Fig. 4. Local log quantization: Score achieved by VGG-16, MobileNet and hybrid CNN-RNN with varying bit precision under local log quantization. VGG-16 requires minimum bit precision to achieve full precision (fp) accuracy while MobileNet requires maximum bit precision.

\ Fig. 5. Resource comparison: Comparison of normalized computational complexity (GFLOPS/sample) and minimum memory required (Mbits) by VGG-16, MobileNet and hybrid CNN-RNN. MobileNet and hybrid CNNRNN present a trade-off between computational complexity and memory required for optimum performance.

\ Finally, Our proposed system requires data pre-processing, feature extraction and classification only once in each breathing cycle. Therefore, if we consider a ping-pong buffer architecture [54] for audio acquisition and processing, our system needs to perform end to end classification of breathing cycles at a latency smaller than minimum breathing cycle duration for real time operation. The primary computational bottleneck of the proposed system is the DL architecture as mentioned earlier. The number of computations of the proposed architecture is of the same order as Mobilenet as shown in fig. 5. Since the minimum breathing cycle duration is > 1 second [55] and the per sample latency of Mobilenet on modern mobile SoCs is only ∼ 100 ms [56], the proposed system should easily be able to perform real time classification of respiratory anomalies.

\

IV. CONCLUSION

In this paper, we have developed a hybrid CNN-RNN model that produces state of the art results for ICBHI’17 respiratory audio dataset. It produces a score of 66.31% score on 80- 20 split for four-class respiratory cycle classification. We also propose a patient screening and model tuning strategy to identify unhealthy patients and then build patient specific models through patient speecific re-training. This proposed model provides significantly more reliable results for the original train-test split achieving a score of 71.81% for leaveone-out cross-validation. It is observed that trained models from image recognition field, surprisingly, perform better in transferring knowledge than those pre-trained on speech. A possible explanation for this could be that while image-trained models are trained on a much larger Imagenet dataset and therefore, has better generalization performance compared to models trained on relatively smaller audio datasets. While lack of availability of pre-trained models in audio domain and prohibitively long training time required for training a model with audio datasets of sizes comparable to Imagenet prevent us from verifying this hypothesis in this work, in future we plan to explore transfer learning performance of audio and image datasets in further detail. We also develop a local log quantization strategy for reducing the memory cost of the models that achieves ≈ 4× reduction in minimum memory required without loss of performance. The primary significance of this result is that this weight quantization strategy is able to achieve considerable weight compression without any architectural modification to the model or quantization aware training. Finally, while the proposed model has higher computational complexity than MobileNet, it has minimal memory footprint among the models under consideration. Since the amount of data from a single patient is still very small for this dataset, in future, we plan to employ this strategy with larger amount of patient specific data. We also plan to create an embedded implementation of this algorithm for a wearable device to be used in patient monitoring at home. Further reductions in computational complexity will be explored using a neuromorphic spike based approach [57], [58].

\

REFERENCES

[1] N. Gavriely, Y. Palti, G. Alroy, and J. B. Grotberg, “Measurement and theory of wheezing breath sounds,” Journal of Applied Physiology, vol. 57, no. 2, pp. 481–492, 1984.

\ [2] P. Piirila and A. Sovijarvi, “Crackles: recording, analysis and clinical significance,” European Respiratory Journal, vol. 8, no. 12, pp. 2139– 2148, 1995.

\ [3] M. Bahoura and C. Pelletier, “Respiratory sounds classification using gaussian mixture models,” in Canadian Conference on Electrical and Computer Engineering 2004 (IEEE Cat. No. 04CH37513), vol. 3. IEEE, 2004, pp. 1309–1312.

\ [4] J. Acharya, A. Basu, and W. Ser, “Feature extraction techniques for lowpower ambulatory wheeze detection wearables,” in 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2017, pp. 4574–4577.

\ [5] B.-S. Lin and B.-S. Lin, “Automatic wheezing detection using speech recognition technique,” Journal of Medical and Biological Engineering, vol. 36, no. 4, pp. 545–554, 2016.

\ [6] M. Bahoura, “Pattern recognition methods applied to respiratory sounds classification into normal and wheeze classes,” Computers in biology and medicine, vol. 39, no. 9, pp. 824–843, 2009.

\ [7] J. Zhang, W. Ser, J. Yu, and T. Zhang, “A novel wheeze detection method for wearable monitoring systems,” in 2009 International Symposium on Intelligent Ubiquitous Computing and Education. IEEE, 2009, pp. 331–334.

\ [8] P. Bokov, B. Mahut, P. Flaud, and C. Delclaux, “Wheezing recognition algorithm using recordings of respiratory sounds at the mouth in a pediatric population,” Computers in biology and medicine, vol. 70, pp. 40–50, 2016.

\ [9] I. Sen, M. Saraclar, and Y. P. Kahya, “A comparison of svm and gmmbased classifier configurations for diagnostic classification of pulmonary sounds,” IEEE Transactions on Biomedical Engineering, vol. 62, no. 7, pp. 1768–1776, 2015.

\ [10] N. Jakovljevic and T. Lon ´ car-Turukalo, “Hidden markov model based ˇ respiratory sound classification,” in Precision Medicine Powered by pHealth and Connected Health. Springer, 2018, pp. 39–43.

\ [11] R. X. A. Pramono, S. Bowyer, and E. Rodriguez-Villegas, “Automatic adventitious respiratory sound analysis: A systematic review,” PloS one, vol. 12, no. 5, p. e0177926, 2017.

\ [12] H. Chen, X. Yuan, Z. Pei, M. Li, and J. Li, “Triple-classification of respiratory sounds using optimized s-transform and deep residual networks,” IEEE Access, vol. 7, pp. 32 845–32 852, 2019.

\ [13] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sanchez, ´ “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.

\ [14] E. Hosseini-Asl, G. Gimel’farb, and A. El-Baz, “Alzheimer’s disease diagnostics by a deeply supervised adaptable 3d convolutional network,” arXiv preprint arXiv:1607.00556, 2016.

\ [15] M. J. van Grinsven, B. van Ginneken, C. B. Hoyng, T. Theelen, and C. I. Sanchez, “Fast convolutional neural network training using selective data ´ sampling: Application to hemorrhage detection in color fundus images,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1273–1284, 2016.

\ [16] Y. Song, L. Zhang, S. Chen, D. Ni, B. Lei, and T. Wang, “Accurate segmentation of cervical cytoplasm and nuclei based on multiscale convolutional network and graph partitioning,” IEEE Transactions on Biomedical Engineering, vol. 62, no. 10, pp. 2421–2433, 2015.

\ [17] O. Oktay, W. Bai, M. Lee, R. Guerrero, K. Kamnitsas, J. Caballero, A. de Marvao, S. Cook, D. ORegan, and D. Rueckert, “Multi-input cardiac image super-resolution using convolutional neural networks,” in International conference on medical image computing and computerassisted intervention. Springer, 2016, pp. 246–254.

\ [18] P. Kisilev, E. Sason, E. Barkan, and S. Hashoul, “Medical image description using multi-task-loss cnn,” in Deep Learning and Data Labeling for Medical Applications. Springer, 2016, pp. 121–129.

\ [19] P. V. Tran, “A fully convolutional neural network for cardiac segmentation in short-axis mri,” arXiv preprint arXiv:1604.00494, 2016.

\ [20] H. K. van der Burgh, R. Schmidt, H.-J. Westeneng, M. A. de Reus, L. H. van den Berg, and M. P. van den Heuvel, “Deep learning predictions of survival based on mri in amyotrophic lateral sclerosis,” NeuroImage: Clinical, vol. 13, pp. 361–369, 2017.

\ [21] T. Kooi, B. van Ginneken, N. Karssemeijer, and A. den Heeten, “Discriminating solitary cysts from soft tissue lesions in mammography using a pretrained deep convolutional neural network,” Medical physics, vol. 44, no. 3, pp. 1017–1027, 2017.

\ [22] X. Chen, Y. Xu, D. W. K. Wong, T. Y. Wong, and J. Liu, “Glaucoma detection based on deep convolutional neural network,” in 2015 37th annual international conference of the IEEE engineering in medicine and biology society (EMBC). IEEE, 2015, pp. 715–718.

\ [23] Y. Bengio, P. Simard, P. Frasconi et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.

\ [24] H. Salehinejad, S. Sankar, J. Barfett, E. Colak, and S. Valaee, “Recent advances in recurrent neural networks,” arXiv preprint arXiv:1801.01078, 2017.

\ [25] A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, “Action recognition in video sequences using deep bi-directional lstm with cnn features,” IEEE Access, vol. 6, pp. 1155–1166, 2018.

\ [26] Y. Zhao, X. Jin, and X. Hu, “Recurrent convolutional neural network for speech processing,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5300–5304.

\ [27] J. Amoh and K. Odame, “Deep neural networks for identifying cough sounds,” IEEE transactions on biomedical circuits and systems, vol. 10, no. 5, pp. 1003–1011, 2016.

\ [28] H. Nakano, T. Furukawa, and T. Tanigawa, “Tracheal sound analysis using a deep neural network to detect sleep apnea,” Journal of Clinical Sleep Medicine, vol. 15, no. 08, pp. 1125–1133, 2019.

\ [29] H. Ryu, J. Park, and H. Shin, “Classification of heart sound recordings using convolution neural network,” in 2016 Computing in Cardiology Conference (CinC). IEEE, 2016, pp. 1153–1156.

\ [30] H. Chang, J. Han, C. Zhong, A. M. Snijders, and J.-H. Mao, “Unsupervised transfer learning via multi-scale convolutional sparse coding for biomedical applications,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 5, pp. 1182–1194, 2018.

\ [31] A. Payan and G. Montana, “Predicting alzheimer’s disease: a neuroimaging study with 3d convolutional neural networks,” arXiv preprint arXiv:1502.02506, 2015.

\ [32] L. S. Hu, H. Yoon, J. M. Eschbacher, L. C. Baxter, A. C. Dueck, A. Nespodzany, K. A. Smith, P. Nakaji, Y. Xu, L. Wang et al., “Accurate patient-specific machine learning models of glioblastoma invasion using transfer learning,” American Journal of Neuroradiology, vol. 40, no. 3, pp. 418–425, 2019.

\ [33] A. Bellot and M. Schaar, “Boosting transfer learning with survival data from heterogeneous domains,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 57–65.

\ [34] S. Kiranyaz, T. Ince, R. Hamila, and M. Gabbouj, “Convolutional neural networks for patient-specific ecg classification,” in 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2015, pp. 2608–2611.

\ [35] P. G. Gibson, “Monitoring the patient with asthma: an evidence-based approach,” Journal of Allergy and Clinical Immunology, vol. 106, no. 1, pp. 17–26, 2000.

\ [36] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Advances in neural information processing systems, 2016, pp. 4107–4115.

\ [37] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.

\ [38] T. Sheng, C. Feng, S. Zhuo, X. Zhang, L. Shen, and M. Aleksic, “A quantization-friendly separable convolution for mobilenets,” in 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2). IEEE, 2018, pp. 14– 18.

\ [39] B. Rocha, D. Filos, L. Mendes, I. Vogiatzis, E. Perantoni, E. Kaimakamis, P. Natsiavas, A. Oliveira, C. Jacome, A. Marques ´ et al., “A respiratory sound database for the development of automated classification,” in Precision Medicine Powered by pHealth and Connected Health. Springer, 2018, pp. 33–37.

\ [40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

\ [41] K. Kochetov, E. Putin, M. Balashov, A. Filchenkov, and A. Shalyto, “Noise masking recurrent neural network for respiratory sound classification,” in International Conference on Artificial Neural Networks. Springer, 2018, pp. 208–217.

\ [42] D. Perna, “Convolutional neural networks learning from respiratory data,” in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2018, pp. 2109–2113.

\ [43] G. Chambres, P. Hanna, and M. Desainte-Catherine, “Automatic detection of patient with respiratory diseases using lung sound analysis,” in 2018 International Conference on Content-Based Multimedia Indexing (CBMI). IEEE, 2018, pp. 1–6.

\ [44] E. Cakır, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291–1303, 2017.

\ [45] J. Sang, S. Park, and J. Lee, “Convolutional recurrent neural networks for urban sound classification using raw waveforms,” in 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2444– 2448.

\ [46] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

\ [47] P. Warden, “Speech commands: A public dataset for single-word speech recognition,” Dataset available from http://download. tensorflow. org/data/speech commands v0, vol. 1, 2017.

\ [48] R. X. A. Pramono, S. A. Imtiaz, and E. Rodriguez-Villegas, “Evaluation of features for classification of wheezes and normal respiratory sounds,” PloS one, vol. 14, no. 3, p. e0213659, 2019.

\ [49] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.

\ [50] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “Cnn architectures for large-scale audio classification,” in 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 2017, pp. 131–135.

\ [51] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in 22nd ACM International Conference on Multimedia (ACM-MM’14), Orlando, FL, USA, Nov. 2014, pp. 1041–1044.

\ [52] S. Alyamkin, M. Ardi, A. Brighton, A. C. Berg, B. Chen, Y. Chen, H.-P. Cheng, Z. Fan, C. Feng, B. Fu et al., “Low-power computer vision: Status, challenges, opportunities,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2019.

\ [53] T. Gokmen, M. Rasch, and W. Haensch, “Training lstm networks with resistive cross-point devices,” Frontiers in neuroscience, vol. 12, p. 745, 2018.

\ [54] D. Katz, “Fundamentals of embedded audio, part 3,” Sep 2007. [Online]. Available: https://www.eetimes.com/ fundamentals-of-embedded-audio-part-3/

\ [55] W. Q. Lindh, M. Pooler, C. D. Tamparo, B. M. Dahl, and J. Morris, Delmar’s comprehensive medical assisting: administrative and clinical competencies. Cengage Learning, 2013.

\ [56] A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool, “Ai benchmark: Running deep neural networks on android smartphones,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.

\ [57] A. Basu, J. Acharya, and T. K. an et . al, “Low-power, adaptive neuromorphic systems: Recent progress and future directions,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, pp. 6–27, 2018.

\ [58] J. Acharya, A. Patil, X. Li, Y. Chen, S. C. Liu, and A. Basu, “A comparison of low-complexity real-time feature extraction for neuromorphic speech recognition,” Frontiers in neuroscience, vol. 12, p. 160, 2018.

\

:::info Authors:

(1) Jyotibdha Acharya (Student Member, IEEE), HealthTech NTU, Interdisciplinary Graduate Program, Nanyang Technological University, Singapore;

(2) Arindam Basu (Senior Member, IEEE), School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.

:::

\

Dataset, Features, Model, and Quantization Strategy for Respiratory Sound Classification

2025-09-09 17:34:01

Table of Links

Abstract and I Introduction

II. Materials and Methods

III. Results and Discussions

IV. Conclusion and References

II. MATERIALS AND METHODS

A. Dataset

For this work we have used the International Conference on Biomedical and Health Informatics (ICBHI’17) scientific challenge respiratory sound database [39]. This is the largest publicly available respiratory sound database. The database

\ Fig. 1. Hybrid CNN-RNN: a three stage deep learning model. Stage 1 is a CNN that extracts abstract feature maps from input Mel-spectrograms, stage 2 consists of a Bi-LSTM layer that learns temporal features and stage 3 consists of fully connected (FC) and softmax layers that convert outputs to class predictions.

\ contains 920 recordings from 126 patients. Each breathing cycle in a recording is annotated by respiratory experts as one of the four classes: normal, wheeze, crackle and both (wheeze and crackle). The database contains a total of 6898 respiratory cycles out of which 1864 cycles contain crackles, 886 contain wheeze, 506 contain both and rest are normal. The dataset contains samples recorded with different equipment (AKG C417L Microphone, 3M Littmann Classic II SE Stethoscope, 3M Litmmann 3200 Electronic Stethoscope and WelchAllyn Meditron Master Elite Electronic Stethoscope) from hospitals in Portugal and Greece. The data is recorded from different locations of chest: 1) Trachea 2) Anterior left 3) Anterior right 4) Posterior left 5) Posterior right 6) Lateral left and 7) Lateral right. Furthermore, a significant number of samples are noisy. These characteristics make the classification problem more challenging and much closer to real world scenarios compared to manually curated datasets recorded under ideal conditions. Further details about the database and data collection methods can be found in [39].

\

B. Evaluation Metrics

In the original challenge, out of 920 recordings, 539 recordings were marked as training samples and 381 recordings were marked as testing samples. There are no common patients between training and testing set. For this work we used the officially described evaluation metrics for the four-class (normal(N), crackle(C), wheeze(W) and both(B)) classification problem defined as follows:

\

\ For a more complete evaluation of the proposed model, we also evaluate it using other commonly used metrics such as precision, recall and f1-score. Moreover, the dataset has disproportionate number of normal vs anomalous samples and the official metrics are micro-averaged (calculated over all the classes).Therefore, there is a chance that performance of the models on one class overshadows the other classes in overall results. Therefore, we calculated the precision, recall and f1- score using macro-averaging (metrics are computed for each class individually and then averaged).

\

C. Related Work

A number of papers have been published so far analyzing this dataset. Jakovljevic et al [10] used hidden markov model with Gaussian mixture model to classify the breathing cycles. They have used spectral subtraction based noise suppression to pre-process the data and MFCC features are used for classification. Their models obtained a score of 39.56% on the original train-test split and 49.5% on 10-fold cross-validation of the training set.

\ Kochetov et al. [41] proposed a noise marking RNN for the four-class classification. Their proposed model contains two sections: an attention network for binary classification of respiratory cycles into noisy and non-noisy classes and an RNN for four class classification. The attention network learns to identify noisy parts of the audio and suppress those sections and passes the filtered audio to the RNN for classifications. With a 80-20 split, they obtained a score of 65.7%. They didn’t report the score for the original train-test split. Though this method reports relatively higher scores, one primary issue with this method is that there are no noise labels in the metadata of the ICBHI dataset and the paper doesn’t mention any method for obtaining these labels. Since there are no known objective methods to measure the noise labels in these type of audio signals, this kind of manual labeling of the respiratory cycles makes their results unreliable and irreproducible.

\ Perna et al. [42] used a deep CNN architecture to classify the breathing cycles into healthy and unhealthy and obtained an accuracy of 83% using a 80-20 train-test split and MFCC features. They also did a ternary classification of the recordings into healthy,chronic and non-chronic diseases and obtained an accuracy of 82%.

\ Chen et al. [12] used optimized S-transform based feature maps along with deep residual nets (ResNets) on a smaller subset of the dataset (489 recordings) to classify the samples (not individual breathing cycles) into three classes (N, C and W) and obtained an accuracy of 98.79% on a 70-30 train-test split.

\ Finally, Chambres et al. [43] have proposed a patient level model where they classify the individual breathing cycles into one of the four classes using lowlevel features (melbands, mfcc, etc), rythm features (loudness, bpm etc), the SFX features (harmonicity and inharmonicity information) and the tonal features (chords strength, tuning frequency etc). They used boosted tree method for the classification. Next, they classified the patients as healthy or unhealthy based on the percentage of breathing cycles of the patient classified as abnormal. They have obtained an accuracy of 49.63% on the breathing cycle level classification and an accuracy of 85% on patient level classification. The justification for this patient level model is that medical professionals do not take decisions about patients based on individual breathing cycles but rather based on longer breathing sound segments and the trends represented by several breathing cycles of a patient can provide a more consistent diagnosis. A summary of the literature is presented in table I.

\

D. Proposed Method

1) Feature Extraction and Data augmentation: Since the audio samples in the dataset had different sampling frequencies, first all of the signals were down-sampled to 4kHz. Since both wheeze and crackle signals are typically present within frequency range 0 − 2kHz, down-sampling the audio samples to 4kHz should not cause any loss of relevant information.

\ As the dataset is relatively small for training a deep learning model, we used several data augmentation techniques to increase the size of the dataset. We used noise addition, speed variation, random shifting, pitch shift etc to create augmented samples. Aside from increasing the dataset size, these data augmentation methods also help the network learn useful data representations in-spite of different recording conditions, different equipments, patient age and gender, inter-patient variability of breathing rate etc.

\ For feature extraction we have used Mel-frequency spectrogram with a window size of 60 ms with 50% overlap. Each breathing cycle is converted to a 2D image where rows correspond to frequencies in Mel scale and columns correspond to time (window) and each value represent log amplitude value of the signal corresponding to that frequency and time window.

\ 2) Hybrid CNN-RNN: We propose a hybrid CNN-RNN model (figure 1) that consists of three stages: the first stage is a deep CNN model that extracts abstract feature representations from the input data, the second stage consists of a bidirectional long short term memory layer (Bi-LSTM) that learns temporal relations and finally in the third stage we have fully connected and softmax layers that convert the output of previous layers to class prediction. While these type of hybrid CNN-RNN architectures have been more commonly used in sound event detection ([44], [45]), due to sporadic nature of wheeze and crackle as well as their temporal and frequency variance, similar hybrid architectures may prove useful for lung sound classification.

\ The first stage consists of batch-normalization, convolution and max-pool layers. The batch normalization layer scales the input images over each batch to stabilize the training. In the 2D convolution layer the input is convolved with 2D kernels to produce abstract feature maps. Each convolution layer is followed by Rectified Linear activation functions (ReLU). The max-pool layer selects the maximum values from a pixel neighborhood which reduces the overall network parameters and results in shift-invariance [13].

\ LSTM have been proposed by Hochreiter and Schmidhuber [46] consisting of gated recurrent cells that block or pass the data in a sequence or time series by learning the perceived importance of data points. Each current output and the hidden state of a cell is a function of current as well as past values of the data. Bidirectional LSTM consists of two interconnected LSTM layers, one of which operates on the same direction as data sequence while the other operates on the reverse direction. So, the current output of the Bi-LSTM layer is function of current, past and future values of the data. We used tanh as non-linear activation function for this layer.

\

\ To benchmark the performance of our proposed model, we compare it to two standard CNN models, VGG-16 [40] and Mobilenet [37]. Since our dataset size is limited even after data augmentation, it can cause overfitting if we train these models from scratch on our dataset. Hence, we used Imagenet trained weights instead and replaced the dense layers of these models with an architecture similar to the fully connected and softmax layers of our proposed CNN-RNN architecture. Then the models are trained with a small learning rate.

\ TABLE ISUMMARY OF EXISTING LITERATURE ON ICBHI DATASET

\ Fig. 2. Boxplot of intra-patient and inter-patient variability of audio features: Intra-patient variability is computed by normalizing each audio feature by average of that feature for the corresponding patient while for Inter-patient variability, the normalization is done by average of the audio feature over the entire dataset. Diverse set of features are used for comparison including breathing cycle duration, energy related feature (RMS energy), noise related feature (ZCR) and spectral features (bandwidth, roll-off). Inter-patient variability is significantly larger than intra-patient variability for all the cases.

\ 3) Patient Specific Model Tuning: Though most of the existing research concentrate on developing generalized models for classifying respiratory anomalies, the problem with this kind of models is that their performance can often deteriorate for a completely new patient due to inter-patient variability. This kind of inconsistent performance of classification models make them unreliable and thus hinders their wide scale adoption. To qualitatively evaluate the inter-patient variability, we show the boxplot of inter-patient and intra-patient variability of a diverse set of audio features (duration, RMSE, bandwidth, Roll-off, ZCR) in fig. 2. For the intra-patient variability, we normalized each audio feature of a sample by average of that feature for the samples from that specific patient while for inter-patient variability, we normalized the audio features by the average of that feature over the entire dataset. As evident from the figure, inter-patient variability is significantly larger when compared to intra-patient variability.

\ Also, from a medical professional’s perspective, for most of the chronic respiratory patients, some patient data is already available or can be collected and automated long-term monitoring of patient condition after initial treatment is often very important. Though training a model based on existing patient specific data to extract patient specific features result in a more consistent and reliable patient specific model, it is often very difficult to collect enough data from a patient to sufficiently train a machine learning model. Since deep learning models require much larger amount of data for training, the issue is further exacerbated.

\ To address these shortcomings of existing methods, we propose a patient specific model tuning strategy that can take advantage of deep learning techniques even with small amount of patient data available. In this proposed model, the deep network is first trained on a large database to learn domain specific feature representations. Then a smaller part of the network is re-trained on the small amount of patient specific data available. This enables us to transfer the learned domain specific knowledge of the deep network to patient specific models and thus produce consistent patient specific class predictions with high accuracy. In our proposed model we train the 3 stage network on the training samples. Then, for a new patient, only the last stage is re-trained with patient specific breathing cycles while the learned CNN-RNN stage weights are frozen in their pre-trained values. For our proposed strategy only ∼ 1.4% of the network parameters are retrained for patient specific models. For VGG-16 and MobileNet, the same strategy is applied.

\ 4) Weight Quantization: In this proposed weight quantization scheme, the magnitude of weights of each layer are quantized in log domain. The quantized weight (w˜) can be represented as:

\

\

:::info Authors:

(1) Jyotibdha Acharya (Student Member, IEEE), HealthTech NTU, Interdisciplinary Graduate Program, Nanyang Technological University, Singapore;

(2) Arindam Basu (Senior Member, IEEE), School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.

:::


:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.

:::

\

Patient-Specific CNN-RNN for Wheeze and Crackle Detection with 4× Memory Savings

2025-09-09 17:26:15

:::info Authors:

(1) Jyotibdha Acharya (Student Member, IEEE), HealthTech NTU, Interdisciplinary Graduate Program, Nanyang Technological University, Singapore;

(2) Arindam Basu (Senior Member, IEEE), School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.

:::

Table of Links

Abstract and I Introduction

II. Materials and Methods

III. Results and Discussions

IV. Conclusion and References

\ Abstract—The primary objective of this paper is to build classification models and strategies to identify breathing sound anomalies (wheeze, crackle) for automated diagnosis of respiratory and pulmonary diseases. In this work we propose a deep CNN-RNN model that classifies respiratory sounds based on Melspectrograms. We also implement a patient specific model tuning strategy that first screens respiratory patients and then builds patient specific classification models using limited patient data for reliable anomaly detection. Moreover, we devise a local log quantization strategy for model weights to reduce the memory footprint for deployment in memory constrained systems such as wearable devices. The proposed hybrid CNN-RNN model achieves a score of 66.31% on four-class classification of breathing cycles for ICBHI’17 scientific challenge respiratory sound database. When the model is re-trained with patient specific data, it produces a score of 71.81% for leave-one-out validation. The proposed weight quantization technique achieves ≈ 4× reduction in total memory cost without loss of performance. The main contribution of the paper is as follows: Firstly, the proposed model is able to achieve state of the art score on the ICBHI’17 dataset. Secondly, deep learning models are shown to successfully learn domain specific knowledge when pre-trained with breathing data and produce significantly superior performance compared to generalized models. Finally, local log quantization of trained weights is shown to be able to reduce the memory requirement significantly. This type of patient-specific re-training strategy can be very useful in developing reliable long-term automated patient monitoring systems particularly in wearable healthcare solutions.

\

I. INTRODUCTION

Two most clinically significant lung sound anomalies are wheeze and crackle. Wheeze is a continuous high pitched adventitious sound that results from obstruction of breathing airway. While normal breathing sounds have majority of their energy concentrated in 80-1600Hz [1], wheeze sounds have been shown to be present in the frequency range 100Hz2KHz. Wheeze is normally associated with patients suffering from asthma, chronic obstructive pulmonary disease (COPD) etc. Crackles are explosive and discontinuous sounds present during inspiratory and expiratory parts of breathing cycle with a significantly smaller duration compared to the total breathing cycle. Crackles have been associated with obstructive airway diseases and interstitial lung diseases [2].

\ Auscultation has been used historically for screening and monitoring respiratory diseases. It provides a simple and non-invasive approach to detect respiratory and cardiovascular diseases based on lung sound abnormalities. But these methods suffer from two disadvantages. Firstly, a trained medical professional is required to diagnose a patient based on adventitious lung sounds and therefore, disproportionate number of medical practitioners compared to overall population hinders the speed at which patients are tested. Secondly, even if the patients are diagnosed by experienced professionals, there might be subjectivity in the diagnosis due to dissimilar interpretation of the respiratory sounds by different medical professionals [3].

\ So, in the past decade several attempts were made to design algorithms and feature extraction techniques for automated detection of breathing anomalies. Some popular feature extraction techniques used include spectrogram [4], MelFrequency Cepstral Coefficients (MFCC) [5], wavelet coefficients [6], entropy based features [7] etc. Several machine learning (ML) algorithms have been developed in past few years to detect breathing sound anomalies such as logistic regression [8], Dynamic Time Wrap (DTW), Gaussian mixture model (GMM) [9], random forest [4], Hidden Markov Model (HMM) [10] etc. An exploration of existing literature reveals some conspicuous issues with these approaches. Firstly, most of the ML algorithms use manually crafted highly complex features suitable for their algorithms and due to absence of publicly available datasets, it was hard to compare the efficacy of the feature extraction methods and algorithms proposed [11]. Secondly, most of the strategies were developed for a binary classification problem to identify either wheeze or crackle and therefore, not suitable for multi-class classification to detect wheeze and crackle simultaneously [12]. These drawbacks make these approaches difficult to apply in real world scenarios.

\ Deep learning has gained a lot of attention in recent years due to its unparalleled success in a variety of applications including clinical diagnostics and biomedical engineering [13]. A significant advantage of these deep learning paradigms is that there is no need to manually craft features from the data since the network learns useful features and abstract representations from the data through training. Due to wide success of convolutional neural networks (CNN) in image related tasks, they have been extensively used in biomedical research for image classification [14], anomaly detection [15], image segmentation [16], image enhancement [17], automated report generation [18] etc. There have been multiple successful applications of deep CNNs in diagnosis of cardiac diseases [19], neurological diseases [20], cancer [21] and ophthalmic diseases [22]. While CNNs have shown significant promise for analyzing image data, recurrent neural networks (RNN) are better suited for learning long term dependencies in sequential and time-series data [23]. The state of the art systems in natural language processing (NLP), audio and speech processing use deep RNNs to learn sequential and temporal features [24]. Finally, hybrid CNN-RNN models have shown significant success in video analytics [25] and speech recognition [26]. These hybrid models show particular promise in cases where both spatial and temporal/sequential features need to be learned from the data.

\ Since deep learning came into prominence, it is also being used by researchers for audio based biomedical diagnosis and anomaly detection. Some significant areas of audio based diagnosis using deep learning include sleep apnea detection, cough sound identification, heart sound classification etc. Amoh et al. [27] used a chest mounted sensor to collect audio containing both speech and cough sounds and then used both CNN and RNN to identify cough sounds. Similarly, Nakano et al. [28] trained a deep neural networks on tracheal sound spectrograms to detect sleep apnea. In [29], authors train a CNN architecture to classify heart sounds into normal and abnormal classes. The non-invasive nature of audio based diagnosis make them an attractive choice for biomedical applications.

\ A major handicap in training a deep network is that a significantly large dataset and considerable time and resources need to be allocated for the training. While the second issue can be solved by using dedicated deep learning accelerators (GPU,TPU etc), the first issue is even more exacerbated for medical research since medical datasets are very sparse and difficult to obtain. One way to circumvent this issue is to use transfer learning. The central idea behind transfer learning is following: a deep network trained in a domain D1 to perform task T1 can successfully use the learned data representations to perform task T2 in domain D2. Most commonly used method for transfer learning is using a large dataset to train a deep network and then re-training a small section of the network on the available data (often significantly small) for the specific task and specific domain. Transfer learning has been used in medical research for cancer diagnosis [30], prediction of neurological diseases [31] etc. While traditionally transfer learning refers to transfer of knowledge between two disparate domains, for biomedical research, it is also used for knowledge transfer in the same domain where a model is trained on a larger population dataset and the knowledge is then transferred for context specific learning on a smaller dataset [32][33]. This strategy is specially useful for biomedical applications due to scarcity of domain specific patient data.

\ Finally, for employing machine learning methods for medical diagnosis, two primary approaches are used. The first one is generalized models where the models are trained on a database of multiple patient data and it is tested on new patient data. This type of models learns generalized features present across all the patients. While this kind of models are often easier to deploy, they often suffer from inter-patient variability of features and may not produce reliable results for unseen patient data. The second approach is patient-specific models, where the models are trained on patient-specific data to produce more precise results for the patient-specific diagnosis. While these models are harder to train due to difficulty in collecting large amount of patient-specific data, they often produce very reliable and consistent results [34]. While a patient-specific model requires additional time and effort from healthcare professionals for collecting and labeling the data, specially for chronic diseases where long-term monitoring is of the essence, this additional effort is well compensated by reduced hospitalization and reduced loss of time from work for the patient resulting from better continuous monitoring [35].

\ Since a large fraction of medical diagnosis algorithms are geared toward wearable devices and mobile platforms, large memory and computational power requirement of deep learning methods present a considerable challenge for commercial deployment. Weight quantization [7], low precision computation [36] and lightweight networks [37] are some of the approaches used to address this challenge. Quantizing the weights of the trained network is the most straight-forward way to reduce the memory requirement for deployment. DNNs with 8 or 16 bit weights have been shown to achieve comparable accuracy compared to their full precision counterpart [7]. Though linear quantization is most commonly used, log quantization has been shown to achieve similar accuracy at lower bit precision [38]. Finally, lightweight networks like MobileNet [37] reduces computational complexity and memory requirement without significant loss of accuracy by replacing traditional convolution layers by depthwise separable convolution.

\ In this paper we propose a hybrid CNN-RNN model to perform four class classification of breathing sounds on International Conference on Biomedical and Health Informatics (ICBHI’17) scientific challenge respiratory sound database [39] and then devise a screen and model tuning strategy to build patient specific diagnosis models from limited patient data. For comparison of our model with more commonly used CNN architectures, we applied the same methodology on VGGnet [40] and Mobilenet [37] architecture. Finally, we propose a layerwise logarithmic quantization scheme that can reduce the memory footprint of the networks without significant loss of performance. The sections are organized as follows: section II describes the dataset, feature extraction method, proposed deep learning model and weight quantization. Section III tabulates the results for generalized and patient specific model and quantization performance. Finally, section IV discusses the conclusions and main contributions of the paper.

\

:::info This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.

:::

\

Cut 90% of Fine-Tuning Cost—Still Beat Baselines on Text and Vision Benchmarks

2025-09-09 17:13:23

Table of Links

Abstract and 1. Introduction

  1. Related Work

  2. Preliminaries

  3. Proposed Method

  4. Experimental Setup

  5. Results and Analysis

  6. Discussion and Conclusion, and References

    \

A. The Connection Between Prefix-tuning and Hypernetwork

B. Number of Tunable Parameters

C. Input-output formats

6. Results and Analysis

We design a series of experiments for pure language and V&L tasks on both multi-tasking and few-shot scenarios to verify the effectiveness of our proposed framework compared to existing methods.

\

6.1. Results on the GLUE Benchmark

\ For Prefix-tuning (Li & Liang, 2021) and MAMAdapter (He et al., 2021), their original implementation is single-task training on BART (Lewis et al., 2020). To make a fair comparison to other baselines, we apply their methods to T5 in a multi-task training setting. [4] For each model, we share the parameters of both prefix vectors and adapter weights across multitasks.

\ Overall, our HyperPELT method obtains the best performance with less trainable parameters. Compared to the single-task Adapters that finetunes all the introduced parameters in adapters, our method yields a significant improvement by 2.21% with much fewer trainable parameters, which illustrates the effectiveness of our proposed multi-task training framework.

\ In multi-task training, the proposed hypernetwork-based prefix-tuning strategy, e.g., HyperPrefix, decreases the number of trainable parameters (e.g., 1.01× of HyperPrefix vs. 1.14× of Prefix-tuning), while achieves a better performance at the same time (e.g., 86.65% of HyperPrefix vs. 86.09% of Prefix-tuning). It is noticeable that the number of trainable parameters per task is 11× fewer than Prefix-tuning.

\ HyperPELT obtains a superior performance over HyperPrefix, and the main reason lies in that we further combine the hypernetwork-based adapters and add them to the feedforward layers in a parallel manner. In this way, the average

\ Table 2. Few-shot domain transfer results of five different tasks averaged across 5 seeds. We compute accuracy for all tasks and datasets. †: Results from the paper of Mahabadi et al. (2021). HyperPELT and HyperPELT TaskEmbed are respectively finetuning hypernetworks with all hyper-embeddings and only task embedding in the few-shot learning.

\ performance is further enhanced (+0.44%) by involving a small number of parameters (0.09% parameters per task). The comparison to MAMAdapter shows that using hypernetwork to tune each transformer block and learn the shared knowledge across multitasks leads to an improvement.

\

6.2. Few-shot Domain Transfer

We use the above models trained on GLUE as reported in Table 1, and evaluate them on the test set of five different tasks after being few-shot finetuned on each target training data. Following Mahabadi et al. (2021), we use the task embedding respectively trained on the most similar GLUE task for initialization, i.e., MNLI for CB, QNLI for QA, SST2 for sentiment analysis, and QQP for paraphrase detection.

\ As suggested by Perez et al. (2021) and Zhao & Schutze ¨ (2021), we randomly select the same number of samples from training and validation set, making it a reasonable fewshot scenario. Checkpoints are selected via early stopping on the selected validation set, and the stopping metric is the default metric for each task.

\ In the first three columns of Table 2, we show the results of full fine-tuning of T5BASE, HYPERFORMER++ (finetuning both hypernetworks and task embeddings) and our proposed HyperPELT. Overall, our method achieves the best performance in the few-shot tasks.

\ For the tasks of CB and BoolQ from SuperGLUE, even though the backbone T5 was previously trained on the train sets of these two, the performance of all methods differs a lot. The two baselines still do not work with very few samples, like 4 and 16 samples, while our method is significantly better than them. Therefore, we assume that the two baselines suffer from catastrophic forgetting problems to some degree during multi-task training. In contrast, our proposed HyperPELT works effectively on these two tasks. We speculate that the reason might be the use of hypernetworks on both prefix-tuning and adapter-tuning modules of transformer. We leave this exploration to our future work.

\

\

6.3. Results on the Vision-Language Benchmarks

we move to the experiments of applying the proposed hypernetwork-based parameter-efficient training framework to V&L tasks. We compare to the pre-trained and full fine-tuning VL-T5 (Cho et al., 2021), and other adapter-based methods built on top of T5, i.e., CLIP-T5 and VLAdapter (Sung et al., 2021) in the multi-task training setting.

\ Table 3. Experimental results on popular Vision-Language banchmarks. We report vqa-score for VQA, gqa-score for GQA and various metrics for image captioning (B@4: BLEU@4, M: METEOR, C: CIDEr, S: SPICE). †: Results from the paper of Cho et al. (2021); Sung et al. (2021), ♠: Our re-implementation of Sung et al. (2021). Following Cho et al. (2021), we use VQA Karpathy split, which splits the VQA dataset into 605,102 / 26,729 / 26,280 image and question pairs separately as the train/validation/test set to evaluate VQA tasks in a generative manner.

\ Table 4. Few-shot domain transfer results of two different V&L tasks averaged across 5 seeds. We report the vqa-score on OKVQA validation split, and the accuracy on SNLI-VE test-P split.

\

\ To our best knowledge, we are the first to employ the visual modality to tune the very few parameters of different transformer blocks, instead of normally inserting image patch tokens to the input sequence. Experimental results evidence the effectiveness of our novel approach, thus providing a new perspective on how to extend the multi-modality capability on top of PLMs. It is to use the features from different modalities as the input of a hypernetwork to generate parameters for modules in PLMs, instead of as a part of input sequence to accomplish the multimodal tasks. One advantage in our approach is still keeping the original maximum text input length, since no other modalities such as visual and audio features occupy it. It is promisingly useful in document-level and text-heavy tasks such as multimodal summarization (Zhang et al., 2022).

\ We believe the resulting performance might be even better with a more complex design combination of methods across tuning task-specific and visual-specific parameters in PLMs, but we leave this exploration in future work.

\

6.4. Multimodal Few-shot Learning

We further use the models trained on V&L tasks as reported in Table 4 and evaluate them on the test set after few-shot fine-tuning on OKVQA (Marino et al., 2019) and SNLIVE (Xie et al., 2018). For OKVQA, since there is no test set, we split its original validation set into two halves, one for validating and the other for testing. For SNLI-VE, we use its validation set for validating, and test-P set for testing and reporting results. We follow the methods in Section 6.2 to select samples and report results in Table 4.

\ Compared with the full parameter fine-tuning, i.e., CLIP-T5, and the other baseline VL-Adapter, our method achieves the best performance with smaller variances in this multimodal few-shot learning setting. We find that VL-Adapter is inferior to CLIP-T5 when with fewer samples (e.g., fewer than 500) on the OKVQA dataset. The reason may be that there exists a lot of out-domain knowledge and complex image content in OKVQA, which makes it more challenging for the parameter-efficient VL-Adapter to achieve accurate prediction. In other words, the small number of samples are not enough to train the introduced randomly initialized parameters in VL-Adapter.

\ However, our approach can still tackle with fewer samples. We use the hypernetwork to generate trainable parameters in adapters and multi-head attention, as well as directly integrating image features into attention modules in the form of prefix tuning vectors. We believe such method, though training less parameters, can still capture knowledge across tasks and transfer them in a few-shot setting. It is also worth noting that for the used five random seeds, the variance of our method is generally smaller than VL-Adapter, which indicates that our method is more robust in this few-shot learning scenario.

\

7. Discussion and Conclusion

In this paper, we propose a unified parameter-efficient tuning framework for multitasks, particularly on both pure language and vision-and-language (i.e., V&L) tasks. On the one hand, we use a hypernetwork to reduce the scale of trainable parameters of existing adapter-tuning and prefix-tuning modules. On the other hand, for the V&L tasks, we directly integrate the image features into the multi-head attention in the form of prefix vectors, which further reduces the number of trainable parameters for processing visual input. Extensive experiments on pure language and V&L tasks demonstrate the superiority of our proposed framework in both multi-tasking and few-shot settings. In the future, we plan to explore more combination of methods across tuning task-specific and visual-specific parameters for different modules in a pretrained language model.

\

References

Aharoni, R., Johnson, M., and Firat, O. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3874–3884, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1388. URL https: //aclanthology.org/N19-1388.

\ Cho, J., Lei, J., Tan, H., and Bansal, M. Unifying visionand-language tasks via text generation. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 1931–1942. PMLR, 2021. URL http://proceedings.mlr.press/ v139/cho21a.html.

\ Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423. URL https://doi.org/ 10.18653/v1/n19-1423.

\ Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6325–6334. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.670. URL https://doi.org/10.1109/CVPR.2017.670.

\ Ha, D., Dai, A. M., and Le, Q. V. Hypernetworks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum? id=rkpACe1lx.

\ He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. Towards a unified view of parameter-efficient transfer learning. CoRR, abs/2110.04366, 2021. URL https: //arxiv.org/abs/2110.04366.

\ He, Y., Zheng, H. S., Tay, Y., Gupta, J. P., Du, Y., Aribandi, V., Zhao, Z., Li, Y., Chen, Z., Metzler, D., Cheng, H., and Chi, E. H. Hyperprompt: Prompt-based task-conditioning of transformers. CoRR, abs/2203.00759, 2022. URL https://arxiv.org/abs/2203.00759.

\ Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., and Gelly, S. Parameter-efficient transfer learning for NLP. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 2790–2799. PMLR, 2019. URL http://proceedings.mlr.press/ v97/houlsby19a.html.

\ Hudson, D. A. and Manning, C. D. GQA: A new dataset for real-world visual reasoning and compositional question answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 6700–6709. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00686. URL http: //openaccess.thecvf.com/contentCVPR 2019/html/HudsonGQAANewDataset_ forReal-WorldVisualReasoningand_ CompositionalCVPR2019_paper.html.

\ Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D. A., Bernstein, M. S., and Fei-Fei, L. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123(1):32–73, 2017. doi: 10.1007/ s11263-016-0981-7. URL https://doi.org/10. 1007/s11263-016-0981-7.

\ Lee, J., Tang, R., and Lin, J. What would elsa do? freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090, 2019.

\ Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. CoRR, abs/2104.08691, 2021. URL https://arxiv.org/ abs/2104.08691.

\ Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. R. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 7871–7880. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.703. URL https://doi. org/10.18653/v1/2020.acl-main.703.

\ Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 4582–4597. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.353. URL https://doi. org/10.18653/v1/2021.acl-long.353.

\ Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, C. L. Microsoft ´ COCO: common objects in context. In Fleet, D. J., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.), Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pp. 740–755. Springer, 2014. doi: 10.1007/ 978-3-319-10602-1\ 48. URL https://doi.org/ 10.1007/978-3-319-10602-1_48.

\ Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. CoRR, abs/2107.13586, 2021a. URL https://arxiv. org/abs/2107.13586.

\ Liu, Y., Agarwal, S., and Venkataraman, S. Autofreeze: Automatically freezing model blocks to accelerate finetuning. arXiv preprint arXiv:2102.01386, 2021b.

\ Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In Lin, D., Matsumoto, Y., and Mihalcea, R. (eds.), The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19- 24 June, 2011, Portland, Oregon, USA, pp. 142–150. The Association for Computer Linguistics, 2011. URL https://aclanthology.org/P11-1015/.

\ Mahabadi, R. K., Ruder, S., Dehghani, M., and Henderson, J. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp. 565– 576. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.47. URL https:// doi.org/10.18653/v1/2021.acl-long.47.

\ Mao, Y., Mathias, L., Hou, R., Almahairi, A., Ma, H., Han, J., Yih, W., and Khabsa, M. Unipelt: A unified framework for parameter-efficient language model tuning. CoRR, abs/2110.07577, 2021. URL https://arxiv.org/ abs/2110.07577.

\ Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R. OK-VQA: A visual question answering benchmark requiring external knowledge. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 3195–3204. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00331. URL http://openaccess.thecvf.com/content_ CVPR2019/html/MarinoOK-VQAA VisualQuestionAnsweringBenchmark RequiringExternalKnowledgeCVPR 2019_paper.html.

\ Perez, E., Kiela, D., and Cho, K. True few-shot learning with language models. CoRR, abs/2105.11447, 2021. URL https://arxiv.org/abs/2105.11447.

\ Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

\ Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 8748– 8763. PMLR, 2021. URL http://proceedings. mlr.press/v139/radford21a.html.

\ Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074. html.

\ Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., Dey, M., Bari, M. S., Xu, C., Thakker, U., Sharma, S., Szczechla, E., Kim, T., Chhablani, G., Nayak, N. V., Datta, D., Chang, J., Jiang, M. T., Wang, H., Manica, M., Shen, S., Yong, Z. X., Pandey, H., Bawden, R., Wang, T., Neeraj, T., Rozen, J., Sharma, A., Santilli, A., Fevry, T., Fries, J. A., Teehan, R., Biderman, ´ S., Gao, L., Bers, T., Wolf, T., and Rush, A. M. Multitask prompted training enables zero-shot task generalization. CoRR, abs/2110.08207, 2021. URL https: //arxiv.org/abs/2110.08207.

\ Sung, Y.-L., Cho, J., and Bansal, M. Vl-adapter: Parameterefficient transfer learning for vision-and-language tasks. 2021.

\ Tsimpoukelli, M., Menick, J., Cabi, S., Eslami, S. M. A., Vinyals, O., and Hill, F. Multimodal fewshot learning with frozen language models. CoRR, abs/2106.13884, 2021. URL https://arxiv.org/ abs/2106.13884.

\ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998– 6008, 2017. URL https://proceedings. neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract. html.

\ von Oswald, J., Henning, C., Sacramento, J., and Grewe, B. F. Continual learning with hypernetworks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview. net/forum?id=SJgwNerKvB.

\ Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. Superglue: A stickier benchmark for general-purpose language understanding systems. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alche-Buc, F., Fox, E. B., and ´ Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 3261– 3275, 2019a. URL https://proceedings. neurips.cc/paper/2019/hash/ 4496bf24afe7fab6f046bf4923da8de6-Abstract. html.

\ Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019b. URL https://openreview.net/ forum?id=rJ4km2R5t7.

\ Xie, N., Lai, F., Doran, D., and Kadav, A. Visual entailment task for visually-grounded language learning. CoRR, abs/1811.10582, 2018. URL http://arxiv.org/ abs/1811.10582.

\ Yang, Z., Gan, Z., Wang, J., Hu, X., Lu, Y., Liu, Z., and Wang, L. An empirical study of GPT-3 for few-shot knowledge-based VQA. CoRR, abs/2109.05014, 2021. URL https://arxiv.org/abs/2109.05014.

\ Zhang, T., Wu, F., Katiyar, A., Weinberger, K. Q., and Artzi, Y. Revisiting few-sample BERT fine-tuning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/ forum?id=cO1IH43yUF.

\ Zhang, Y., Baldridge, J., and He, L. PAWS: paraphrase adversaries from word scrambling. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 1298–1308. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1131. URL https://doi.org/10.18653/v1/n19-1131.

\ Zhang, Z., Meng, X., Wang, Y., Jiang, X., Liu, Q., and Yang, Z. Unims: A unified framework for multimodal summarization with knowledge distillation. AAAI, 2022.

\ Zhao, M. and Schutze, H. Discrete and soft prompting for ¨ multilingual models. In Moens, M., Huang, X., Specia, L., and Yih, S. W. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pp. 8547–8555. Association for Computational Linguistics, 2021. doi: 10.18653/ v1/2021.emnlp-main.672. URL https://doi.org/ 10.18653/v1/2021.emnlp-main.672.

\

:::info Authors:

(1) Zhengkun Zhang, with Equal contribution from Work is done at the internship of Noah’s Ark Lab, Huawei Technologies

(2) Wenya Guo and TKLNDST, CS, Nankai University, China ([email protected]);

(3) Xiaojun Meng, with Equal contribution from Noah’s Ark Lab, Huawei Technologies;

(4) Yasheng Wang, Noah’s Ark Lab, Huawei Technologies;

(5) Yadao Wang, Noah’s Ark Lab, Huawei Technologies;

(6) Xin Jiang, Noah’s Ark Lab, Huawei Technologies;

(7) Qun Liu, Noah’s Ark Lab, Huawei Technologies;

(8) Zhenglu Yang, TKLNDST, CS, Nankai University, China.

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[4] For adapting Prefix-tuning from BART to T5, a noteworthy point is that since they use different position embedding, i.e., absolute position embedding for BART and relative position embedding for T5, it is necessary to manually concatenate all-zero vectors on the relative position bias of each layer in T5.

Dataset Splits, Vision Encoder, and Hyper-PELT Implementation Details

2025-09-09 17:08:26

Table of Links

Abstract and 1. Introduction

  1. Related Work

  2. Preliminaries

  3. Proposed Method

  4. Experimental Setup

  5. Results and Analysis

  6. Discussion and Conclusion, and References

    \

A. The Connection Between Prefix-tuning and Hypernetwork

B. Number of Tunable Parameters

C. Input-output formats

5. Experimental Setup

5.1. Datasets

Our framework is evaluated on the GLUE benchmark (Wang et al., 2019b) in terms of natural language understanding.

\ This benchmark covers multiple tasks of paraphrase detection (MRPC, QQP), sentiment classification (SST-2), natural language inference (MNLI, RTE, QNLI), and linguistic acceptability (CoLA). The original test sets are not publicly available, and following Zhang et al. (2021), for datasets fewer than 10K samples (RTE, MRPC, STS-B, CoLA), we split the original validation set into two halves, one for validation and the other for testing. For other larger datasets, we randomly split 1K samples from the training set as our validation data and test on the original validation set.

\ In addition, we evaluate the few-shot domain transfer performance on four tasks and datasets: 1) the natural language inference (NLI) datasets CB and 2) the question answering (QA) dataset BoolQ from SuperGLUE (Wang et al., 2019a); 3) the sentiment analysis datasets IMDB (Maas et al., 2011); and 4) the paraphrase detection dataset PAWS (Zhang et al., 2019). For CB and BoolQ, since the test set is not available, we split the validation set into two halves, one for validation and the other for testing. For IMDB, since the validation set is not available, we similarly split the test set to form validation. For PAWS, we report on the original test set.

\ To evaluate our framework on V&L tasks, we experiment on four datasets COCO (Lin et al., 2014), VQA (Goyal et al., 2017), VG-QA (Krishna et al., 2017) and GQA (Hudson & Manning, 2019). We further evaluate our framework on three datasets for multi-modal few-shot transfer learning: OKVQA (Marino et al., 2019); SNLI-VE (Xie et al., 2018).

5.2. Implementation Details

\ For evaluating our framework on vision-language scenarios, we follow Cho et al. (2021) to convert V&L tasks to a text generation format. We use ResNet101 as our vision encoder, and initialize it with CLIP (Radford et al., 2021) [3] pretrained weights. Input images are resized to 224 × 224

\ Table 1. Performance of all models on the GLUE tasks. For each method, we report the total number of parameters across all tasks and the number of parameters that are trained for each task as a multiple and proportion respectively of the baseline single-task T5 model. For MNLI, we report accuracy on the matched validation set. For MRPC and QQP, we report accuracy and F1. For STS-B, we report Pearson and Spearman correlation coefficients. For CoLA, we report Matthews correlation. For all other tasks, we report accuracy. †: Results from nthe implementation of Mahabadi et al. (2021), ♠: Our re-implementation of (Mahabadi et al., 2021), ♣: We implement the methods of Li & Liang (2021) and He et al. (2021) on top of T5.

\ for the memory efficiency. We extract the 7 × 7 grid features produced by the last convolutional layer. The percentage of updated parameters is also reported as one metric for approach efficiency, and we do not take visual encoder into computation since it is frozen in our experiments. We count the number of tunable parameters and list the input-output formats of each task in the Appendix B and C.

\

:::info Authors:

(1) Zhengkun Zhang, with Equal contribution from Work is done at the internship of Noah’s Ark Lab, Huawei Technologies

(2) Wenya Guo and TKLNDST, CS, Nankai University, China ([email protected]);

(3) Xiaojun Meng, with Equal contribution from Noah’s Ark Lab, Huawei Technologies;

(4) Yasheng Wang, Noah’s Ark Lab, Huawei Technologies;

(5) Yadao Wang, Noah’s Ark Lab, Huawei Technologies;

(6) Xin Jiang, Noah’s Ark Lab, Huawei Technologies;

(7) Qun Liu, Noah’s Ark Lab, Huawei Technologies;

(8) Zhenglu Yang, TKLNDST, CS, Nankai University, China.

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[2] https://huggingface.co/t5-base

\ [3] https://github.com/openai/CLIP