A COMPREHENSIVE REVIEW OF OBJECT DETECTION IN ANIMAL AND PLANT USING VISION TRANSFORMER

Authors: Maolan Lin, Zhenchang Gao, Honghao Cai, Wenliang Liao
Journal: Journal of Animal and Plant Sciences (JAPS)
ISSN: 1018-7081 (Print), 2309-8694 (Online)
Volume: 36  Issue: 2  Pages: 321-330
Year: 2026
DOI: https://doi.org/10.36899/JAPS.2026.2.0027
URL: https://doi.org/https://doi.org/10.36899/JAPS.2026.2.0027
Publisher: Pakistan Agricultural Scientists Forum

Abstract:
<p style="text-align: justify;">In digital farming, computers serve as primary sensing eyes and object detection is the core vision task that locates and counts the target objects of interest, i.e., plants, fruits and livestock, in various agricultural systems. While, Vision Transformers (ViTs), a natural language processing alternative to convolutional neural networks by capturing global context through self-attention, have shown great potential in object detection. However, the field of ViT-based detectors remains fragmented, with independent advances in plant and animal studies and a lack of comprehensive analysis connecting these domains. To bridge this gap, we conducted a systematic review, retaining 30 primary studies after a dual screening and quality appraisal process&mdash;20 focused on plant production and 10 on animal production. Our analysis shows that ViT-based models excel in multi-scale representation, complex scene reasoning, and efficient feature extraction. These capabilities give high accuracy in fruit quality assessment, crop growth monitoring, weed detection, meat grading and livestock behaviour surveillance. However, challenges such as high computational complexity, large parameter sizes, environmental variability, small object detection, and data annotation requirements remain. For researchers and practitioners, this review offers a unified framework to understand ViT-based detection. It pinpoints cross-domain challenges and concludes with a forward-looking pathway to turn these insights into practical, on-farm solutions.</p>

Keywords: Convolutional neural network, Computer vision, Deep learning; Machine learning, ViT, YOLO