원문정보
초록
영어
We present a systematic review of paradigm shifts in computer vision from 2020 to 2025. The survey centers on Vision Transformers(ViT), large-scale self-supervised learning contrastive, MAE/BEiT, multimodal pretraining CLIP, SAM, diffusion-based generation, and 3D representations via NeRF. Using a literature-synthesis framework, we compare architectures, training regimes, and transfer benefits and limits across major tasks. Evidence shows transformer families rival or surpass CNNs on dense-prediction task detection, segmentation, while diffusion models enable stabler training and higher-quality generation than GANs. Self-supervised learning reduces labeling cost and improves generalization in low-label regimes. Multimodal models unlock zero-shot and open-vocabulary recognition; foundation models such as SAM demonstrate general-purpose segmentation. Persisting challenges include data bias, substantial compute/energy demand, and limited explainability. We recommend efficiency-oriented compression distillation, pruning, quantization, green-AI practices, and guidelines for responsible use of foundation models. The outlook highlights edge/embedded realtime vision, 3D/video understanding, and applications in healthcare, remote sensing, and AR/metaverse. Overall, the period is defined by large-scale pretraining, a shift to transformers, multimodal integration, and advances in 3D—pointing to the next goal: responsible and efficient vision AI.
목차
1. Introduction
2. Methods
2.1 Major Research Trends in the Last 5 Years
2.2 Vision Transformer(VIT)
2.3 Rise of Self-supervised Learning
2.4 Multimodal Learning and Vision-Language Models
2.5 Innovation in Generative Model: Spread Model in GAN
2.6 3D Vision and Neural Radiance Fields
2.7 Advanced Object Detection and Image Segmentation
3. Results
4. Discussion
5. Conclusion
References
