Reducing annotation burden in physical activity research using vision language models
Wasfy, M. M. & Lee, I.-M. Examining the dose–response relationship between physical activity and health outcomes. NEJM Evid. 1(12), EVIDra2200190 (2022).
Google Scholar
Servais, L. et al. First regulatory qualification of a digital primary endpoint to measure treatment efficacy in DMD. Nat. Med. 29(10), 2391–2392 (2023).
Google Scholar
Troiano, R. P., Stamatakis, E. & Bull, F. C. How can global physical activity surveillance adapt to evolving physical activity guidelines? Needs, challenges and future directions. Br. J. Sports Med. 54(24), 1468–1473 (2020).
Google Scholar
Logacjov, A., Herland, S., Ustad, A. & Bach, K. SelfPAB: Large-scale pre-training on accelerometer data for human activity recognition. Appl. Intell. 54(6), 4545–4563 (2024).
Google Scholar
Yuan, H. et al. Self-supervised learning for human activity recognition using 700,000 person-days of wearable data. NPJ Digit. Med. 7(1), 91 (2024).
Google Scholar
Walmsley, R. et al. Reallocation of time between device-measured movement behaviours and risk of incident cardiovascular disease. Br. J. Sports Med. 56(18), 1008–1017 (2022).
Google Scholar
Willetts, M., Hollowell, S., Aslett, L., Holmes, C. & Doherty, A. Statistical machine learning of sleep and physical activity phenotypes from sensor data in 96,220 UK Biobank participants. Sci. Rep. 8(1), 7961 (2018).
Google Scholar
Doherty, A. et al. Large scale population assessment of physical activity using wrist worn accelerometers: The UK biobank study. PLoS ONE 12(2), e0169649 (2017).
Google Scholar
Bao, L. & Intille, S. S. Activity recognition from user-annotated acceleration data. In International Conference on Pervasive Computing, 1–17 (Springer, 2004).
Keadle, S. K., Lyden, K. A., Strath, S. J., Staudenmayer, J. W. & Freedson, P. S. A framework to evaluate devices that assess physical behavior. Exerc. Sport Sci. Rev. 47(4), 206–214 (2019).
Google Scholar
Thomaz, E. & Dimiccoli, M. Acquisition and analysis of camera sensor data (lifelogging). In Mobile Sensing in Psychology: Methods and Applications, 277 (2023).
Tufte, E. R. The Visual Display of Quantitative Information 2nd edn. (Graphics Press, 2002).
Tremblay, M. S. et al. Sedentary behavior research network (SBRN)-terminology consensus project process and outcome. Int. J. Behav. Nutr. Phys. Act. 14, 1–17 (2017).
Google Scholar
Ainsworth, B. E. et al. 2011 compendium of physical activities: A second update of codes and met values. Med. Sci. Sports Exerc. 43(8), 1575–1581 (2011).
Google Scholar
Keadle, S. K. et al. Using computer vision to annotate video-recoded direct observation of physical behavior. Sensors 24(7), 2359 (2024).
Google Scholar
Schalkamp, A.-K., Peall, K. J., Harrison, N. A. & Sandor, C. Wearable movement-tracking data identify Parkinson’s disease years before clinical diagnosis. Nat. Med. 29(8), 2048–2056 (2023).
Google Scholar
Shreves, A. H., Small, S. R., Travis, R. C., Matthews, C. E. & Doherty, A. Dose–response of accelerometer-measured physical activity, step count, and cancer risk in the UK Biobank: A prospective cohort analysis. Lancet 402, S83 (2023).
Google Scholar
Bull, F. C. et al. World Health Organization 2020 guidelines on physical activity and sedentary behaviour. Br. J. Sports Med. 54(24), 1451–1462 (2020).
Google Scholar
Chan, S. et al. Capture-24: A large dataset of wrist-worn activity tracker data collected in the wild for human activity recognition. Sci. Data 11(1), 1135 (2024).
Google Scholar
Kelly, P. et al. An ethical framework for automated, wearable cameras in health behavior research. Am. J. Prev. Med. 44(3), 314–319 (2013).
Google Scholar
Ainsworth, B. E., Herrmann, S. D., Jacobs Jr, D. R., Whitt-Glover, M. C. & Tudor-Locke, C. A brief history of the compendium of physical activities. J. Sport Health Sci. 13(1), 3 (2024).
Google Scholar
Bureau of Labor Statistics. American Time Use Survey, 2024. Accessed 13 May 2024.
Herath, S., Harandi, M. & Porikli, F. Going deeper into action recognition: A survey. Image Vis. Comput. 60, 4–21 (2017).
Google Scholar
Chen, Y. et al. Device-measured movement behaviours in over 20,000 China Kadoorie Biobank participants. Int. J. Behav. Nutr. Phys. Act. 20(1), 138 (2023).
Google Scholar
Byrne, N. M., Hills, A. P., Hunter, G. R., Weinsier, R. L. & Schutz, Y. Metabolic equivalent: One size does not fit all. J. Appl. Physiol. 99, 1112–1119 (2005).
Google Scholar
Walmsley, R. Device-Measured 24-Hour Movement Behaviours and Risk of Incident Cardiovascular Disease. PhD thesis, University of Oxford (2022).
Kozey, S. L., Lyden, K., Howe, C. A., Staudenmayer, J. W. & Freedson, P. S. Accelerometer output and MET values of common physical activities. Med. Sci. Sports Exerc. 42(9), 1776 (2010).
Google Scholar
Pober, D. M., Staudenmayer, J., Raphael, C. & Freedson, P. S. Development of novel techniques to classify physical activity mode using accelerometers. Med. Sci. Sports Exerc. 38(9), 1626 (2006).
Google Scholar
Montoye, A. H. K., Begum, M., Henning, Z. & Pfeiffer, K. A. Comparison of linear and non-linear models for predicting energy expenditure from raw accelerometer data. Physiol. Meas. 38(2), 343–357 (2017).
Google Scholar
Hills, A. P., Mokhtar, N. & Byrne, N. M. Assessment of physical activity and energy expenditure: An overview of objective measures. Front. Nutr. 1, 5 (2014).
Google Scholar
Kim, Y., Barry, V. W. & Kang, M. Validation of the ActiGraph GT3X and activPAL accelerometers for the assessment of sedentary behavior. Meas. Phys. Educ. Exerc. Sci. 19(3), 125–137. (2015).
Google Scholar
Kerr, J. et al. Using the SenseCam to improve classifications of sedentary behavior in free-living settings. Am. J. Prev. Med. 44(3), 290–296 (2013).
Google Scholar
Chasan-Taber, L. et al. Update and novel validation of a pregnancy physical activity questionnaire. Am. J. Epidemiol. 192(10), 1743–1753 (2023).
Google Scholar
Nawab, K. A. et al. Accelerometer-measured physical activity and functional behaviours among people on dialysis. Clin. Kidney J. 14(3), 950–958 (2021).
Google Scholar
Martinez, J. Accuracy and Precision of Wearable Camera Media Annotations to Estimate Dimensions of Physical Activity and Sedentary Behavior. PhD thesis, University of Wisconsin-Milwaukee (2024).
Giurgiu, M. et al. Quality evaluation of free-living validation studies for the assessment of 24-hour physical behavior in adults via wearables: Systematic review. JMIR mHealth uHealth 10(6), e36377 (2022).
Google Scholar
Femiano, R., Werner, C., Wilhelm, M. & Eser, P. Validation of open-source step-counting algorithms for wrist-worn tri-axial accelerometers in cardiovascular patients. Gait Posture 92, 206–211 (2022).
Google Scholar
Alphen, H. J. M., Waninge, A., Minnaert, A. E. M. G., Post, W. J. & Putten, A. A. J. Construct validity of the Actiwatch-2 for assessing movement in people with profound intellectual and multiple disabilities. J. Appl. Res. Intell. Disabil. 34(1), 99–110 (2021).
Google Scholar
Bach, K. et al. A machine learning classifier for detection of physical activity types and postures during free-living. J. Meas. Phys. Behav. 5(1), 24–31 (2021).
Google Scholar
Marcotte, R. T. et al. Estimating sedentary time from a hip- and wrist-worn accelerometer. Med. Sci. Sports Exerc. 52(1), 225 (2020).
Google Scholar
Koenders, N. et al. Validation of a wireless patch sensor to monitor mobility tested in both an experimental and a hospital setup: A cross-sectional study. PLoS ONE 13(10), e0206304 (2018).
Google Scholar
Gershuny, J. et al. Testing self-report time-use diaries against objective instruments in real time. Sociol. Methodol. 50(1), 318–349 (2020).
Google Scholar
Doherty, A. et al. GWAS identifies 14 loci for device-measured physical activity and sleep duration. Nat. Commun. 9(1), 1–8 (2018).
Google Scholar
Mann, S. Wearable computing: A first step toward personal imaging. Computer 30(2), 25–32 (1997).
Google Scholar
Aizawa, K., Ishijima, K. & Shiina, M. Summarizing wearable video. In Proceedings 2001 International Conference on Image Processing (Cat. No. 01CH37205), Vol. 3, 398–401 (IEEE, 2001).
Bush, V. et al. As we may think. Atl. Mon. 176(1), 101–108 (1945).
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J. & He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6202–6211 (2019).
Zhang, C.-L., Wu, J. & Li, Y. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, 492–510 (Springer, 2022).
Momeni, L., Caron, M., Nagrani, A., Zisserman, A. & Schmid, C. Verbs in action: Improving verb understanding in video-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 15579–15591 (2023).
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu, X., et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18995–19012 (2022).
Lin, K. Q. et al. Egocentric video-language pretraining. Adv. Neural Inf. Process. Syst. 35, 7575–7586 (2022).
Pramanick, S., Song, Y., Nag, S., Lin, K. Q., Shah, H., Shou, M. Z., Chellappa, R. & Zhang, P. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 5285–5297 (2023).
Bock, M., Van Laerhoven, K. & Moeller, M. Weak-annotation of HAR datasets using vision foundation models. In Proceedings of the 2024 ACM International Symposium on Wearable Computers, ISWC ’24, 55–62 (Association for Computing Machinery, New York, NY, USA, 2024).
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763 (PMLR, 2021).
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A. & Bojanowski, P. Dinov2: Learning Robust Visual Features Without Supervision (2024). arXiv:2304.07193 [cs].
Carreira, J. & Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308 (2017).
Wang, P. & Smeaton, A. F. Using visual lifelogs to automatically characterize everyday activities. Inf. Sci. 230, 147–161 (2013).
Google Scholar
Moghimi, M., Wu, W., Chen, J., Godbole, S., Marshall, S., Kerr, J., & Belongie, S. Analyzing sedentary behavior in life-logging images. In 2014 IEEE International Conference on Image Processing (ICIP), 1011–1015 (IEEE, 2014).
Castro, D., Hickson, S., Bettadapura, V., Thomaz, E., Abowd, G., Christensen, H., & Essa, I. Predicting daily activities from egocentric images using deep learning. In proceedings of the 2015 ACM International symposium on Wearable Computers, 75–82 (2015).
Cartas, A., Marín, J., Radeva, P. & Dimiccoli, M. Recognizing activities of daily living from egocentric images. In Pattern Recognition and Image Analysis: 8th Iberian Conference, IbPRIA 2017, Faro, Portugal, June 20–23, 2017, Proceedings 8, 87–95 (Springer, 2017).
Cartas, A., Radeva, P. & Dimiccoli, M. Activities of daily living monitoring via a wearable camera: Toward real-world applications. IEEE Access 8, 77344–77363 (2020).
Google Scholar
Cartas, A., Talavera, E., Radeva, P., & Dimiccoli, M. Understanding event boundaries for egocentric activity recognition from photo-streams. In International Conference on Pattern Recognition, 334–347 (Springer, 2021).
Damen, D., Doughty, H., Farinella, G. M., Furnari, A., Kazakos, E., Ma, J., Moltisanti, D., Munro, J., Perrett, T., Price, W., et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. Int. J. Comput. Vis., pp. 1–23 (2022).
Grauman, K. et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19383–19400 (2024).
Li, C. et al. Multimodal foundation models: From specialists to general-purpose assistants. Found. Trends Comput. Graph. Vis. 16(1–2), 1–214 (2024).
Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 36, 34892–34916 (2024).
Schuhmann, C. et al. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (IEEE, 2009).
Udandarao, V. et al. No “zero-shot” without exponential data: Pretraining concept frequency determines multimodal model performance. arXiv preprint arXiv:2404.04125 (2024).
Reimers, N. & Gurevych, I. Sentence-bert: Sentence embeddings using Siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 (2019).
Hastie, T., Tibshirani, R., Friedman, J. H. & Friedman, J. H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction Vol. 2 (Springer, 2009).
Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).
Li, J., Li, D., Savarese, S. & Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
Chung, H. W. et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
Wolf, T. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778 (2016).
Dosovitskiy, A. et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv:2010.11929 [cs] (2021).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations (2019).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997).
Google Scholar
Muller, S. G., & Hutter, F. TrivialAugment: Tuning-free yet state-of-the-art data augmentation. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 754–762 (IEEE, Montreal, 2021).
Mirza, M. J. et al. Lafter: Label-free tuning of zero-shot classifier using language and unlabeled image collections. Scjefie 10, 10 (2023).
Richard Landis, J. & Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 33(1), 159 (1977).
Google Scholar
Keadle, S. K. et al. Evaluation of within-and between-site agreement for direct observation of physical behavior across four research groups. J. Meas. Phys. Behav. 1(aop), 1–9 (2023).
Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794 (2016).
Fang, H.-S. et al. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 7157–7173 (2022).
Google Scholar
Martinez, J. et al. Validation of wearable camera still images to assess posture in free-living conditions. J. Meas. Phys. Behav. 4, 47–52 (2021).
Google Scholar
Wang, L. et al. Parameter-efficient fine-tuning in large language models: A survey of methodologies. Artif. Intell. Rev. 58(8), 227 (2025).
Google Scholar
Gu, J., Han, Z., Chen, S., Beirami, A., He, B., Zhang, G., Liao, R., Qin, Y., Tresp, V. & Torr, P. A systematic survey of prompt engineering on vision-language foundation models. arXiv:2307.12980 [cs] (2023).
Tran, Q.-Li., Nguyen, B., Jones, G. J. F. & Gurrin, C. Memorilens: A low-cost lifelog camera using raspberry pi zero. In Proceedings of the 2024 International Conference on Multimedia Retrieval, 1255–1259 (2024).
Mamish, John et al. Nir-sighted: A programmable streaming architecture for low-energy human-centric vision applications. ACM Trans. Embedd. Comput. Syst. 23, 1–26 (2024).
Google Scholar
Pei, B., Chen, G., Xu, J., He, Y., Liu, Y., Pan, K., Huang, Y., Wang, Y., Lu, T., Wang, L. & Qiao, Y. EgoVideo: Exploring egocentric foundation model and downstream adaptation. arXiv:2406.18070 [cs] (2024).
Doherty, A. R. et al. Use of wearable cameras to assess population physical activity behaviours: An observational study. Lancet 380, S35 (2012).
Google Scholar
Gage, R. et al. Fun, food and friends: A wearable camera analysis of children’s school journeys. J. Transp. Health 30, 101604 (2023).
Google Scholar
Mok, T. M., Cornish, F. & Tarr, J. Too much information: Visual research ethics in the age of wearable cameras. Integr. Psychol. Behav. Sci. 49, 309–322 (2015).
Google Scholar
Meyer, L. E. et al. Using wearable cameras to investigate health-related daily life experiences: A literature review of precautions and risks in empirical studies. Res. Ethics 18(1), 64–83 (2022).
Google Scholar
link
