Home / Papers / Top Research Papers on Transformers

Top Research Papers on Transformers

Delve into the transformative world of Transformer models with our curated list of top research papers. Whether you're a novice or an expert, these papers provide valuable insights and advancements in deep learning technology. Discover the latest trends and findings in Transformer research here.

Looking for research-backed answers?Try AI Search

Transformer in Transformer

1317 Citations 2021

Kai Han, An Xiao, E. Wu + 3 more

journal unavailable

It is pointed out that the attention inside these local patches are also essential for building visual transformers with high performance and a new architecture, namely, Transformer iN Transformer (TNT), is explored.

Transformational Leadership

2108 Citations 2023

John Paul Baker, Sabine Hoidn

journal unavailable

This course explores Transformational Leadership as it relates to workforce dynamics and practices and investigates the history of this theory, including the variety of approaches and salient cultural, gender, and business forces influencing its development over time.

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

16899 Citations 2021

Ze Liu, Yutong Lin, Yue Cao + 5 more

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

A hierarchical Transformer whose representation is computed with Shifted windows, which has the flexibility to model at various scales and has linear computational complexity with respect to image size and will prove beneficial for all-MLP architectures.

Transformer-based Transform Coding

118 Citations 2022

Yinhao Zhu, Yang Yang, Taco Cohen

journal unavailable

It is shown that nonlinear transforms built on Swin-transformers can achieve better compression efficiency than transforms built on convolutional neural networks (ConvNets), while requiring fewer parameters and shorter decoding time.

Trainable Transformer in Transformer

12 Citations 2023

A. Panigrahi, Sadhika Malladi, Mengzhou Xia + 1 more

ArXiv

This work proposes an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference (e.g., pre-trained language models), and introduces innovative approximation techniques that allow a TinT model with less than 2 billion parameters to simulateand fine-Tune a 125 million parameter transformer model within a single forward pass.

Transformer

267 Citations 2021

Mukund R. Patel

Shipboard Electrical Power Systems

Absorption Index (AI) remains valid for old-aged unsealed transformers as a simple and effective method of non-destructive control insulation. The reasons for AI decrease within transformer operation are insulation moistening and contamination. Seven gradation levels of the insulation condition and algorithm of the operating procedures are proposed depending on the value of the measured AI and its variation in time. Along with AI, it is recommended to measure the polarisation index (PI) and the PI-2 (R 600 /R 15 ratio).

Towards a Transformation of Philosophy

196 Citations 2023

K. Apel, G. Adey, D. Frisby

journal unavailable

As Apel himself notes in his preface, the expression "Transformation of Philosophy" bears an ambiguity, naming both a change that took place in the development of philosophy as well as Apel's own systematic project. As a historical approach the title characterizes the transformation that philosophy has undergone in 20th century philosophy through an emphasis on the mediation and the configuring power of language. Apel focuses on three main currents, represented by Wittgenstein, Heidegger, and Peirce.

Transformer Tracking

804 Citations 2021

Xin Chen, Bin Yan, Jiawen Zhu + 3 more

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention and presents a Transformer tracking method based on the Siamese-like feature extraction backbone, the designed attention- based fusion mechanism, and the classification and regression head.

FLatten Transformer: Vision Transformer using Focused Linear Attention

94 Citations 2023

Dongchen Han, Xuran Pan, Yizeng Han + 2 more

2023 IEEE/CVF International Conference on Computer Vision (ICCV)

This paper proposes a novel Focused Linear Attention module, which introduces a simple yet effective mapping function and an efficient rank restoration module to enhance the expressiveness of self-attention while maintaining low computation complexity.

The New Integral Transform "Mohand Transform"

7 Citations 2024

Nihal Özdoğan

Journal of Innovative Science and Engineering (JISE)

Investigating solutions of differential equations has been an important issue for scientists. Researchers around the world have talked about different methods to solve differential equations. The type and order of the differential equation enabled us to decide the method that we could choose to find the solution of the equation. One of these methods is the integral transform. Integral transform is the conversion of a real or complex valued function into another function by some algebraic operations. Integral transforms are used to solve many problems in mathematics and engi...

Jump to Conclusions: Short-Cutting Transformers with Linear Transformations

49 Citations 2023

Alexander Yom Din, Taelin Karidi, Leshem Choshen + 1 more

ArXiv

A simple method for casting the hidden representations as final representations, bypassing the transformer computation in-between, using linear transformations is suggested, which far exceeds the prevailing practice of inspecting hidden representations from all layers, in the space of the final layer.

Transformers in Vision: A Survey

1975 Citations 2021

Salman Hameed Khan, Muzammal Naseer, Munawar Hayat + 3 more

ACM Computing Surveys (CSUR)

This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline with an introduction to fundamental concepts behind the success of Transformers, i.e., self-attention, large-scale pre-training, and bidirectional feature encoding.

Correlation Matching Transformation Transformers for UHD Image Restoration

15 Citations 2024

Cong Wang, Jinshan Pan, Wei Wang + 5 more

ArXiv

Experimental results show that the UHDformer reduces about ninety-seven percent model sizes compared with most state-of-the-art methods while significantly improving performance under different training sets on 3 UHD image restoration tasks, including low-light image enhancement, image dehazing, and image deblurring.

MSA Transformer

441 Citations 2021

Roshan Rao, Jason Liu, Robert Verkuil + 5 more

bioRxiv

A protein language model which takes as input a set of sequences in the form of a multiple sequence alignment and is trained with a variant of the masked language modeling objective across many protein families surpasses current state-of-the-art unsupervised structure learning methods by a wide margin.

Video Swin Transformer

1237 Citations 2021

Ze Liu, Jia Ning, Yue Cao + 4 more

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

This paper advocates an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization.

Drivers of Digital Transformation in SMEs

65 Citations 2024

Nessrine Omrani, Nada Rejeb, A. Maalaoui + 2 more

IEEE Transactions on Engineering Management

The empirical results show that the technology context (IT infrastructure and digital tools) along with the existing level of innovation are the main drivers that act as stepping stones in digital technology adoption.

Memorizing Transformers

153 Citations 2022

Yuhuai Wu, M. Rabe, DeLesley S. Hutchins + 1 more

ArXiv

It is demonstrated that an approximate kNN lookup into a non-differentiable memory of recent (key, value) pairs improves language modeling across various benchmarks and tasks, including generic webtext, math papers, books, code, as well as formal theorems (Isabelle).

Scalable Diffusion Models with Transformers

1036 Citations 2022

William S. Peebles, Saining Xie

2023 IEEE/CVF International Conference on Computer Vision (ICCV)

A new class of diffusion models based on the transformer architecture is explored, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches that outperform all prior diffusion models on the class-conditional ImageNet 512×512 and 256×256 benchmarks.

Inception Transformer

143 Citations 2022

Chenyang Si, Weihao Yu, Pan Zhou + 3 more

ArXiv

This work designs an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers, and introduces a frequency ramp structure, which can effectively trade-off high- and low-frequency components across different layers.

Multiscale Vision Transformers

1072 Citations 2021

Haoqi Fan, Bo Xiong, K. Mangalam + 4 more

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

This fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10× more costly in computation and parameters is evaluated.

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

800 Citations 2021

Xiaoyi Dong, Jianmin Bao, Dongdong Chen + 5 more

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

The Cross-Shaped Window self-attention mechanism for computing self-Attention in the horizontal and vertical stripes in parallel that form a cross-shaped window is developed, with each stripe obtained by splitting the input feature into stripes of equal width.

Preference Transformer: Modeling Human Preferences using Transformers for RL

46 Citations 2023

Changyeon Kim, Jongjin Park, Jinwoo Shin + 3 more

ArXiv

This paper introduces a new preference model based on the weighted sum of non-Markovian rewards, a neural architecture that models human preferences using transformers and demonstrates that Preference Transformer can solve a variety of control tasks using real human preferences, while prior approaches fail to work.

RWKV: Reinventing RNNs for the Transformer Era

378 Citations 2023

Bo Peng, Eric Alcaide, Quentin G. Anthony + 29 more

ArXiv

This work proposes a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs, and presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.

Scaling Vision Transformers

923 Citations 2021

Xiaohua Zhai, Alexander Kolesnikov, N. Houlsby + 1 more

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

A ViT model with two billion parameters is successfully trained, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy and performs well for few-shot transfer.

Unpacking the Difference Between Digital Transformation and IT-Enabled Organizational Transformation

386 Citations 2021

Lauri Wessel, Abayomi Baiyere, Roxana Ologeanu-Taddeï + 2 more

J. Assoc. Inf. Syst.

An empirically grounded conceptualization is developed that sets these two phenomena apart, finding that there are two distinctive differences: digital transformation activities leverage digital technology in (re)defining an organization’s value proposition, while IT-enabled organizational transformation activities Leverage digitalTechnology in supporting the value proposition.

“The New Integral Transform “Soham Transform”

51 Citations 2021

S. Khakale, Dinkar P. Patil

SSRN Electronic Journal

--------------------------------------------------------------------------------------------------------------------------------------Submitted: 25-09-2021 Revised: 01-10-2021 Accepted: 05-10-2021 --------------------------------------------------------------------------------------------------------------------------------------ABSTRACT: In this paper a new integral transform namely Soham transform is developed and applied to solve linear ordinary differential equations with constant coefficients.

Vision Transformers Need Registers

172 Citations 2023

Timothée Darcet, Maxime Oquab, J. Mairal + 1 more

ArXiv

This paper identifies and characterize artifacts in feature maps of both supervised and self-supervised ViT networks, and proposes a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role.

Attending to Graph Transformers

71 Citations 2023

Luis Muller, Mikhail Galkin, Christopher Morris + 1 more

ArXiv

A taxonomy of graph transformer architectures is derived, bringing some order to this emerging field by probing how well graph transformers can recover various graph properties, how well they can deal with heterophilic graphs, and to what extent they prevent over-squashing.

Transform or be transformed: the importance of research on managing and sustaining digital transformations

5 Citations 2023

Noël Carroll, N. Hassan, I. Junglas + 2 more

European Journal of Information Systems

Some of the key challenges associated with researching digital transformations within the information systems (IS) field are outlined and the importance of shifting the focus on how digital transformations are managed and sustained is stressed.

Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

163 Citations 2021

Zilong Huang, Youcheng Ben, Guozhong Luo + 3 more

ArXiv

A new vision transformer is proposed, named Shuffle Transformer, which is highly efficient and easy to implement by modifying two lines of code and the depth-wise convolution is introduced to complement the spatial shuffle for enhancing neighbor-window connections.

Energy Transformer

27 Citations 2023

Benjamin Hoover, Yuchen Liang, Bao Pham + 5 more

ArXiv

This work proposes a novel architecture, called the Energy Transformer, that uses a sequence of attention layers that are purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens.

Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

34 Citations 2023

Xuran Pan, Tianzhu Ye, Zhuofan Xia + 2 more

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

A novel local attention module, Slide Attention, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability and is applicable to a variety of advanced Vision Transformer models and compatible with various hardware devices, and achieves consistently improved performances on comprehensive benchmarks.

Loss of Life Transformer Prediction Based on Stacking Ensemble Improved by Genetic Algorithm By IJISRT

1320 Citations 2024

Rosena Shintabella, Catur Edi Widodo, Adi Wibowo

International Journal of Innovative Science and Research Technology (IJISRT)

An innovative model is proposed to improve the accuracy of lost of life transfomer prediction using stacking ensembles enhanced with genetic algorithm (GA), and the developed framework presents a promising solution for accurate and reliable transformer life prediction.

ViViT: A Video Vision Transformer

1798 Citations 2021

Anurag Arnab, Mostafa Dehghani, G. Heigold + 3 more

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

This work shows how to effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets, and achieves state-of-the-art results on multiple video classification benchmarks.

The transformative power of transformers in protein structure prediction

12 Citations 2023

Bernard Moussad, Rahmatullah Roche, Debswapna Bhattacharya

Proceedings of the National Academy of Sciences of the United States of America

The predictive modeling performance of the state-of-the-art protein structure prediction methods built on transformers for 69 protein targets from the recently concluded Critical Assessment of Structure Prediction (CASP15) challenge is reported.

Mnemosyne: Learning to Train Transformers with Transformers

7 Citations 2023

Deepali Jain, K. Choromanski, Sumeet Singh + 4 more

ArXiv

Mnemosyne is a new class of learnable optimizers based on the novel spatio-temporal low-rank implicit attention Transformers that can learn to train entire neural network architectures, including other Transformers, without any task-specific optimizer tuning.

Transformers in Time Series: A Survey

551 Citations 2022

Qingsong Wen, Tian Zhou, Chao Zhang + 4 more

journal unavailable

This paper systematically review Transformer schemes for time series modeling by highlighting their strengths as well as limitations and categorizes time series Transformers based on common tasks including forecasting, anomaly detection, and classification.

Transformations in the Time of The Transformer

1 Citations 2023

P. Faratin, Ray Garcia, Jacomo Corbo

ArXiv

The goal of this article is to offer an organizational framework for making rational choices as enterprises start their transformation journey towards an AI first organization.

Faith and Fate: Limits of Transformers on Compositionality

249 Citations 2023

Nouha Dziri, Ximing Lu, Melanie Sclar + 13 more

ArXiv

The empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

347 Citations 2024

Patrick Esser, Sumith Kulal, A. Blattmann + 14 more

ArXiv

This work improves existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales and presents a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens.

Spike-driven Transformer

54 Citations 2023

Man Yao, Jiakui Hu, Zhaokun Zhou + 4 more

ArXiv

A novel Spike-Driven Self-Attention (SDSA), which exploits only mask and addition operations without any multiplication, and thus having up to $87.2\times$ lower computation energy than vanilla self-attention.

Long-Short Transformer: Efficient Transformers for Language and Vision

119 Citations 2021

Chen Zhu, Wei Ping, Chaowei Xiao + 4 more

journal unavailable

This paper proposes Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks, and proposes a dual normalization strategy to account for the scale mismatch between the two attention mechanisms.

Jamba: A Hybrid Transformer-Mamba Language Model

119 Citations 2024

Opher Lieber, Barak Lenz, Hofit Bata + 19 more

ArXiv

Jamba is presented, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture that provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations.

Scaling Vision Transformers to 22 Billion Parameters

445 Citations 2023

Mostafa Dehghani, Josip Djolonga, Basil Mustafa + 39 more

ArXiv

A recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and a wide variety of experiments on the resulting model, which demonstrates the potential for "LLM-like"scaling in vision, and provides key steps towards getting there.

Vision Transformers for Dense Prediction

1379 Citations 2021

René Ranftl, Alexey Bochkovskiy, V. Koltun

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

D dense prediction transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks, can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art.

Colorization Transformer

145 Citations 2021

Manoj Kumar, Dirk Weissenborn, Nal Kalchbrenner

ArXiv

The Colorization Transformer is presented, a novel approach for diverse high fidelity image colorization based on self-attention that outperforms the previous state-of-the-art on colorising ImageNet based on FID results and based on a human evaluation in a Mechanical Turk test.

The Expressive Power of Transformers with Chain of Thought

64 Citations 2024

William Merrill, Ashish Sabharwal

ArXiv

This paper aims to demonstrate how transformers’ reasoning can be improved by allowing them to use a “chain of thought” or “scratchpad”, i.e., generate and condition on a sequence of intermediate tokens before answering.

Neighborhood Attention Transformer

187 Citations 2022

Ali Hassani, Steven Walton, Jiacheng Li + 2 more

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

NA is a pixel-wise operation, localizing self attention to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA, and is the first efficient and scalable sliding window attention mechanism for vision.

Are Transformers Effective for Time Series Forecasting?

1034 Citations 2022

Ailing Zeng, Mu-Hwa Chen, L. Zhang + 1 more

journal unavailable

Experimental results on nine real-life datasets show that LTSF-Linear surprisingly outperforms existing sophisticated Transformer-based L TSF models in all cases, and often by a large margin.

Online Decision Transformer

181 Citations 2022

Qinqing Zheng, Amy Zhang, Aditya Grover

journal unavailable

This work proposes Online Decision Transformers (ODT), an RL algorithm based on sequence modeling that blends offline pretraining with online finetuning in a unified framework that is competitive with the state-of-the-art in absolute performance on the D4RL benchmark but shows much more significant gains during the finet tuning procedure.