Yifan Yang (杨亦凡)

Biography

Yifan Yang is a Ph.D. student at Shanghai Jiao Tong University (SJTU), a member of Cross Media (X-)Language Intelligence Lab (X-LANCE) in the Department of Computer Science and Engineering, under the supervision of Prof. Xie Chen, and the leadership of Prof. Kai Yu.

His research focuses on spoken language processing, spanning speech synthesis, speech recognition, speech representation learning, and speech interaction. He has published 10+ first-author papers at top-tier conferences (ICML, ACL Main, ACMMM, ICASSP, Interspeech, and ICME), and has received the Hunyuan Fellowship.

He was a core contributor to the Next-gen Kaldi project led by Dr. Daniel Povey, contributing to the open-source toolkits Icefall and Lhotse. He led the development of the open-source speech dataset GigaSpeech 2, and contributed to Libriheavy and LibriheavyMix. He developed PALLE and IST-LM (Streaming VALL-E), and co-developed FELLE and StreamMel (Streaming MELLE), as part of the VALL-E Family.

Interests

Spoken Language Processing: Text-to-Speech Synthesis and Evaluation, Multilingual Speech Recognition, Speech Representation Learning
Multimodal Interaction: Proactive Interaction, Full-Duplex Interaction
Multimodal Understanding

Education

Ph.D., Computer Science and Technology, Shanghai Jiao Tong University, 2023.09-now
B.E., Computer Science and Technology, Tianjin University, 2019.09-2023.07

GPA: 3.91/4.0, Rank: 1/139. [Transcript]

Experiences

Research Intern, Qwen Omni Team, Alibaba Tongyi Lab, 2026.03.09-now

Investigate multimodal proactive interaction and multimodal understanding.

Advised by Dr. Jin Xu.
Research Intern, Hunyuan Speech Team, Tencent TEG, 2025.08.20-2026.03.06

Investigate speech understanding for speaking style modeling.

Co-advised by Dr. Long Zhou and Xu Tan.

Research Intern, VALL-E Team & CoreAI Speech Team, Microsoft, 2024.03.05-2025.08.10

Investigate advanced language modeling for text-to-speech synthesis and streaming text-to-speech synthesis.

Co-advised by Dr. Shujie Liu and Dr. Jinyu Li.

Machine Learning Engineer Intern, Next-gen Kaldi Team, Xiaomi AI Lab, 2022.11.01-2023.08.28

Investigate advanced and efficient open-source end-to-end automatic speech recognition.

Develop the Next-gen Kaldi, including Icefall, Lhotse, k2.

Advised by Dr. Daniel Povey.

News

[2026.05] 2 papers are accepted by ICML 2026.
[2026.04] 3 papers are accepted by ACL 2026 (3 Main), including 1 Best Paper Candidate.
[2026.03] I join the Qwen Omni team in Alibaba.
[2026.01] 1 paper is accepted by ICASSP 2026.
[2026.01] 1 paper is accepted by IEEE JSTSP (IF=13.6).
[2025.08] I join the Hunyuan speech team in Tencent.
[2025.08] 1 paper is accepted by IEEE SPL.
[2025.07] I am honored to be funded by the CIE-Tencent Doctoral Research Incentive Project.
[2025.07] 3 papers are accepted by ACMMM 2025.
[2025.05] 2 papers are accepted by Interspeech 2025.
[2025.05] 3 papers are accepted by ACL 2025 (2 Main, 1 Findings).
[2025.03] 1 paper is accepted by ICME 2025.
[2024.12] 1 paper is accepted by ICASSP 2025.
[2024.12] 1 paper is accepted by AAAI 2025.
[2024.06] 3 papers are accepted by Interspeech 2024.
[2024.03] I join the speech team in Microsoft.
[2024.01] Zipformer is accepted for oral presentation by ICLR 2024.
[2023.12] 3 papers are accepted by ICASSP 2024.
[2023.09] I start to pursue my Ph.D. at Shanghai Jiao Tong University.
[2023.06] I earn my Bachelor's degree in engineering with an excellent student title.
[2023.05] 2 papers are accepted by Interspeech 2023.
[2022.11] I join the Next-gen Kaldi team in Xiaomi.
[2022.06] I join X-LANCE lab in Shanghai Jiao Tong University.

Research

Selected Publications

Check out full publications on Google Scholar.

Zero-Shot Text-to-Speech Synthesis and Evaluation

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

Yifan Yang, Shujie Liu, Jinyu Li, Yuxuan Hu, Haibin Wu, Hui Wang, Jianwei Yu, Lingwei Meng, Haiyang Sun, Yanqing Liu, Yan Lu, Kai Yu, Xie Chen

Oral in Proc. ACMMM 2025
Position: Towards Responsible Evaluation for Text-to-Speech

Yifan Yang*, Hui Wang*, Bing Han, Shujie Liu, Jinyu Li, Yong Qin, Xie Chen

Proc. ICML 2026
Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

Yifan Yang*, Bing Han*, Hui Wang, Wei Wang, Ziyang Ma, Long Zhou, Zengrui Jin, Guanrou Yang, Tianrui Wang, Xu Tan, Xie Chen

Proc. ACL 2026 Main

[Code] [Model] [Dataset]
Measuring Prosody Diversity in Zero-Shot TTS: A New Metric, Benchmark, and Exploration

Yifan Yang*, Bing Han*, Hui Wang, Long Zhou, Wei Wang, Mingyu Cui, Xu Tan, Xie Chen

Proc. ICASSP 2026

[Code] [Model]
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Yifan Yang*, Feiyu Shen*, Chenpeng Du, Ziyang Ma, Kai Yu, Daniel Povey, Xie Chen

Oral in Proc. ICASSP 2024

[Code] [Slides]
Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis

Yifan Yang, Shujie Liu, Jinyu Li, Hui Wang, Lingwei Meng, Haiyang Sun, Yuzhe Liang, Ziyang Ma, Yuxuan Hu, Rui Zhao, Jianwei Yu, Yan Lu, Xie Chen

Preprint in arXiv, 2024
StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling

Hui Wang, Yifan Yang, Shujie Liu, Jinyu Li, Lingwei Meng, Yanqing Liu, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin

IEEE Signal Processing Letters
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

Hui Wang, Jinghua Zhao, Yifan Yang, Shujie Liu, Junyang Chen, Yanzhe Zhang, Shiwan Zhao, Jinyu Li, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin

Proc. ACL 2026 Main

[Code] [Model] [Dataset]

Speech Recognition

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement
Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, Xie Chen

Proc. ACL 2025 Main

[Dataset] [Code] [Slides]

GigaSpeech 2 powers Typhoon ASR series (Typhoon ASR Real-Time, Typhoon Whisper, etc), which represents the state-of-the-art Thai ASR models.
Blank-regularized CTC for Frame Skipping in Neural Transducer

Yifan Yang*, Xiaoyu Yang*, Liyong Guo, Zengwei Yao, Wei Kang, Fangjun Kuang, Long Lin, Xie Chen, Daniel Povey

Proc. Interspeech 2023

[Code]
k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning

Yifan Yang*, Jianheng Zhuo*, Zengrui Jin, Ziyang Ma, Xiaoyu Yang, Zengwei Yao, Liyong Guo, Wei Kang, Fangjun Kuang, Long Lin, Daniel Povey, Xie Chen

Oral in Proc. ICME 2025

[Code] [Slides]
LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

Zengrui Jin*, Yifan Yang*, Mohan Shi*, Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Lingwei Meng, Long Lin, Yong Xu, Shi-Xiong Zhang, Daniel Povey

Oral in Proc. Interspeech 2024

[Dataset]
SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland

Proc. ICML 2026

[Model]
VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining

Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen

Proc. Interspeech 2025

[Code]
Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration

Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, Shiliang Zhang, Xie Chen

Oral in Proc. AAAI 2025

[Code]
Libriheavy: A 50,000 hours ASR Corpus with Punctuation Casing and Context

Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, Daniel Povey

Oral in Proc. ICASSP 2024

[Dataset] [Code]
Zipformer: A Faster and Better Encoder for Automatic Speech Recognition

Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui Jin, Long Lin, Daniel Povey

Oral in Proc. ICLR 2024

[Code]

Open-Source Projects

Awards

Hunyuan Fellowship, China Institute of Electronics & Tencent, 2025
BYD Scholarship, BYD, 2025
Chu Xin Scholarship, Tianjin University, 2022
Baosteel Scholarship, Baosteel Education Foundation, 2021
“Bingchang Zhuang” Scholarship, Tianjin University, 2020

Academic Service

Conference Reviewer

International Conference on Machine Learning (Gold Reviewer at ICML 2026)
International Conference on Learning Representations (ICLR 2026, Notable Reviewer at 2025)
Conference on Neural Information Processing Systems (NeurIPS 2026, 2025)
AAAI Conference on Artificial Intelligence (AAAI 2027, 2026)
ACM International Conference on Multimedia (ACM MM 2026, 2025)
ACL Rolling Review (ACL ARR 2026 May, 2026 March, 2026 January, 2025 October, 2025 May, 2025 February, 2024 December, 2024 October, 2024 June, 2023 October)
International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026, 2025, 2024)
Conference of the International Speech Communication Association (Interspeech 2026)
IEEE International Conference on Multimedia & Expo (ICME 2026, 2025)
IEEE Spoken Language Technology Workshop (SLT 2026, 2024)
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2025)
International Conference on Computational Linguistics (COLING 2025, LREC-COLING 2024)
Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)

Journal Reviewer

IEEE Transactions on Audio, Speech and Language Processing (IEEE TASLP)
IEEE Open Journal of Signal Processing (IEEE OJSP)

Activities

[Invited Talk] Open-source Sharing of F5-TTS and GigaSpeech 2, ModelScope DevCon 2025, 2025.06
[Invited Talk] GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement, Nanyang Technological University (NTU), 2024.06
CS-BAOYAN Owner, the largest nonprofit CS postgraduate recommendation exchange platform in China, 2022.09-2023.09

Teaching Assistance

SJTU CS1501 Programming