podcast
details
.com
Print
Share
Look for any podcast host, guest or anyone
Search
Showing episodes and shows of
Gengyu Wang
Shows
Daily Paper Cast
Step-Audio 2 Technical Report
๐ค Upvotes: 42 | cs.CL, cs.SD, eess.AS Authors: Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen...
2025-07-24
22 min
Daily Paper Cast
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
๐ค Upvotes: 30 | cs.CV Authors: Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu Title: AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning Arxiv: http://arxiv.org/abs/2507.12841v1 Abstract: Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce Any...
2025-07-19
22 min
Daily Paper Cast
Test-Time Scaling with Reflective Generative Model
๐ค Upvotes: 68 | cs.LG, cs.CL Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie Title: Test-Time Scaling with Reflective Generative Model Arxiv: http://arxiv.org/abs/2507.01951v2 Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the bac...
2025-07-15
21 min
Daily Paper Cast
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
๐ค Upvotes: 24 | cs.CL, cs.AI Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaรฏ Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramรฉ, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian...
2025-07-15
20 min
Daily Paper Cast
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
๐ค Upvotes: 37 | cs.CV, cs.AI, cs.CL Authors: Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang Title: Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Arxiv: http://arxiv.org/abs/2507.07999v1 Abstract: Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a d...
2025-07-12
20 min
Daily Paper Cast
4KAgent: Agentic Any Image to 4K Super-Resolution
๐ค Upvotes: 56 | cs.CV, eess.IV Authors: Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu Title: 4KAgent: Agentic Any Image to 4K Super-Resolution Arxiv: http://arxiv.org/abs/2507.07105v1 Abstract: We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly dis...
2025-07-11
26 min
Daily Paper Cast
MemOS: A Memory OS for AI System
๐ค Upvotes: 83 | cs.CL Authors: Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong Title: MemOS: A Memory OS for AI System Arx...
2025-07-09
21 min
Daily Paper Cast
Kwai Keye-VL Technical Report
๐ค Upvotes: 97 | cs.CV Authors: Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yang Zhou, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zhenhua Wu, Zhenyu Li, Zhixin Ling, Ziming Li, Dehua Ma, Di Xu, Haixuan Gao, Hang Li, Jiawei Guo...
2025-07-04
22 min
Daily Paper Cast
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
๐ค Upvotes: 141 | cs.CV, cs.AI, cs.LG Authors: GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Sha...
2025-07-03
24 min
Daily Paper Cast
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
๐ค Upvotes: 29 | cs.CV, cs.AI, cs.MM Authors: Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang Title: VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos Arxiv: http://arxiv.org/abs/2506.10857v1 Abstract: We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and pro...
2025-06-14
22 min
Daily Paper Cast
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
๐ค Upvotes: 26 | cs.CV, cs.AI, cs.CL Authors: Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang Title: Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better Arxiv: http://arxiv.org/abs/2506.09040v1 Abstract: Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vis...
2025-06-12
20 min
Daily Paper Cast
MiniCPM4: Ultra-Efficient LLMs on End Devices
๐ค Upvotes: 60 | cs.CL, cs.AI Authors: MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fan...
2025-06-11
20 min
Daily Paper Cast
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning
๐ค Upvotes: 25 | cs.RO, cs.AI Authors: Sheng Chen, Peiyu He, Jiaxin Hu, Ziyang Liu, Yansheng Wang, Tao Xu, Chi Zhang, Chongchong Zhang, Chao An, Shiyu Cai, Duo Cao, Kangping Chen, Shuai Chu, Tianwei Chu, Mingdi Dan, Min Du, Weiwei Fang, Pengyou Fu, Junkai Hu, Xiaowei Jiang, Zhaodi Jiang, Fuxuan Li, Jun Li, Minghui Li, Mingyao Li, Yanchang Li, Zhibin Li, Guangming Liu, Kairui Liu, Lihao Liu, Weizhi Liu, Xiaoshun Liu, Yufei Liu, Yunfei Liu, Qiang Lu, Yuanfei Luo, Xiang Lv, Hongying Ma, Sai Ma, Lingxian Mi, Sha Sa, Hongxiang Shu, Lei Tian, Chengzhi Wang, Jiayu Wang, Kai...
2025-06-11
21 min
Daily Paper Cast
MiMo-VL Technical Report
๐ค Upvotes: 58 | cs.CL Authors: Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shi...
2025-06-06
19 min
Daily Paper Cast
Shifting AI Efficiency From Model-Centric to Data-Centric Compression
๐ค Upvotes: 124 | cs.CL, cs.AI, cs.CV Authors: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang Title: Shifting AI Efficiency From Model-Centric to Data-Centric Compression Arxiv: http://arxiv.org/abs/2505.19147v1 Abstract: The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gai...
2025-05-28
22 min
Daily Paper Cast
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
๐ค Upvotes: 34 | cs.AI, cs.CL, cs.CV, cs.HC Authors: Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong Title: Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Arxiv: http://arxiv.org/abs/2505.13227v1 Abstract: Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks ove...
2025-05-21
22 min
Daily Paper Cast
Qwen3 Technical Report
๐ค Upvotes: 117 | cs.CL Authors: An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xin...
2025-05-20
21 min
Daily Paper Cast
Seed1.5-VL Technical Report
๐ค Upvotes: 86 | cs.CV, cs.AI Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng, Weiwei Liu, Wenqian Wang, Xianhan Zeng, Xiao Liu, Xiaobo Qin, Xiaohan Ding, Xiaojun Xiao, Xiaoying Zhang, Xuanwei Zhang, Xuehan Xiong, Yanghua Peng, Yangrui Chen, Yanwei Li, Yanxu Hu, Yi Lin, Yiyuan Hu, Yiyuan Zhang, Youbin Wu, Yu Li, Yudong Liu, Yue...
2025-05-14
20 min
Daily Paper Cast
MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
๐ค Upvotes: 53 | cs.CL, cs.AI, cs.LG Authors: Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shu...
2025-05-14
21 min
Daily Paper Cast
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
๐ค Upvotes: 79 | cs.CV, cs.CL Authors: Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang Title: Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models Arxiv: http://arxiv.org/abs/2505.04921v1 Abstract: Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize acr...
2025-05-10
23 min
Daily Paper Cast
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
๐ค Upvotes: 67 | cs.CV Authors: Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang Title: Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning Arxiv: http://arxiv.org/abs/2505.03318v1 Abstract: Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long cha...
2025-05-08
22 min
Daily Paper Cast
RM-R1: Reward Modeling as Reasoning
๐ค Upvotes: 48 | cs.CL, cs.AI, cs.LG Authors: Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji Title: RM-R1: Reward Modeling as Reasoning Arxiv: http://arxiv.org/abs/2505.02387v1 Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or...
2025-05-07
23 min
Daily Paper Cast
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
๐ค Upvotes: 49 | cs.LG, cs.AI, cs.CL Authors: Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen Title: Reinforcement Learning for Reasoning in Large Language Models with One Training Example Arxiv: http://arxiv.org/abs/2504.20571v1 Abstract: We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the...
2025-05-01
22 min
Daily Paper Cast
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
๐ค Upvotes: 51 | cs.CL Authors: Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang Title: The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks Arxiv: http://arxiv.org/abs/2504.15521v1 Abstract: As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal tha...
2025-04-24
22 min
Daily Paper Cast
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
๐ค Upvotes: 172 | cs.CV Authors: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Han Lv, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dah...
2025-04-16
22 min
Daily Paper Cast
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
๐ค Upvotes: 83 | cs.CV, cs.AI Authors: Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Zhiwu Qing, Fei Xiao, Meng Wei, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Xiaobin Zhuang, Xiaojie Li, Xin Xia, Xuyan Chi, Yan...
2025-04-15
22 min
Daily Paper Cast
Kimi-VL Technical Report
๐ค Upvotes: 71 | cs.CV Authors: Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo...
2025-04-12
23 min
Daily Paper Cast
A Unified Agentic Framework for Evaluating Conditional Image Generation
๐ค Upvotes: 25 | cs.CV, cs.CL Authors: Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang Title: A Unified Agentic Framework for Evaluating Conditional Image Generation Arxiv: http://arxiv.org/abs/2504.07046v1 Abstract: Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes lar...
2025-04-11
21 min
Daily Paper Cast
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
๐ค Upvotes: 36 | cs.CL Authors: M-A-P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin Title: COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values Arxiv: http://arxiv.org/abs/2504.05535v1 A...
2025-04-10
21 min
Daily Paper Cast
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
๐ค Upvotes: 98 | cs.AI Authors: Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianming Liu, Tongliang Liu, Yu Su, Huan Sun, Glen Berseth, Jianyun Nie, Ian Foster, Logan Ward, Qingyun Wu, Yu Gu, Mingchen Zhuge, Xiangru Tang, Haohan Wang, Jiaxuan You, Chi Wang, Jian Pei, Qiang Yang, Xiaoliang Qi, Che...
2025-04-05
20 min
Daily Paper Cast
EgoLife: Towards Egocentric Life Assistant
๐ค Upvotes: 21 | cs.CV Authors: Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu Title: EgoLife: Towards Egocentric Life Assistant Arxiv: http://arxiv.org/abs/2503.03803v1 Abstract: We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for thi...
2025-03-08
22 min
Daily Paper Cast
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
๐ค Upvotes: 42 | cs.CL, cs.AI, cs.LG Authors: Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Ali Mousavi, Anh Ngu...
2025-03-05
25 min
Daily Paper Cast
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
๐ค Upvotes: 81 | cs.CL Authors: M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li...
2025-02-22
24 min
Daily Paper Cast
Qwen2.5-VL Technical Report
๐ค Upvotes: 97 | cs.CV, cs.CL Authors: Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin Title: Qwen2.5-VL Technical Report Arxiv: http://arxiv.org/abs/2502.13923v1 Abstract: We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant adv...
2025-02-21
21 min
Daily Paper Cast
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
๐ค Upvotes: 15 | cs.LG, cs.AI, cs.DC Authors: Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, Ion Stoica Title: Autellix: An Efficient Serving Engine for LLM Agents as General Programs Arxiv: http://arxiv.org/abs/2502.13965v1 Abstract: Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ign...
2025-02-21
22 min
Daily Paper Cast
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
๐ค Upvotes: 68 | cs.CL, cs.AI, cs.LG Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng Title: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Arxiv: http://arxiv.org/abs/2502.11089v1 Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while mai...
2025-02-19
23 min
Daily Paper Cast
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
๐ค Upvotes: 41 | cs.AI Authors: Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez Title: The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks Arxiv: http://arxiv.org/abs/2502.08235v1 Abstract: Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited. This paper introduces and analyzes overthinking in LRMs. A phenomenon whe...
2025-02-18
27 min
Daily Paper Cast
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
๐ค Upvotes: 38 | cs.CV, cs.CL Authors: Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Hen...
2025-02-18
23 min
Daily Paper Cast
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
๐ค Upvotes: 22 | cs.CL, cs.CV Authors: Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan Title: MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Arxiv: http://arxiv.org/abs/2502.10391v1 Abstract: Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because cur...
2025-02-18
23 min
Daily Paper Cast
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
๐ค Upvotes: 20 | cs.AI, cs.CL, cs.CV Authors: Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang Title: EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents Arxiv: http://arxiv.org/abs/2502.09560v1 Abstract: Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the...
2025-02-15
21 min
Daily Paper Cast
Exploring the Potential of Encoder-free Architectures in 3D LMMs
๐ค Upvotes: 17 | cs.CV, cs.AI, cs.CL Authors: Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao Title: Exploring the Potential of Encoder-free Architectures in 3D LMMs Arxiv: http://arxiv.org/abs/2502.09620v1 Abstract: Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of enc...
2025-02-15
18 min
Daily Paper Cast
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
๐ค Upvotes: 16 | cs.CL, cs.AI Authors: Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou Title: CoSER: Coordinating LLM-Based Persona Simulation of Established Roles Arxiv: http://arxiv.org/abs/2502.09082v1 Abstract: Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this pap...
2025-02-15
22 min
Daily Paper Cast
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
๐ค Upvotes: 35 | cs.CV Authors: Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li Title: TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation Arxiv: http://arxiv.org/abs/2502.07870v1 Abstract: Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and...
2025-02-14
18 min
Daily Paper Cast
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
๐ค Upvotes: 29 | cs.CV Authors: Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai Title: CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation Arxiv: http://arxiv.org/abs/2502.08639v1 Abstract: In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and...
2025-02-14
23 min
Daily Paper Cast
Enhance-A-Video: Better Generated Video for Free
๐ค Upvotes: 14 | cs.CV Authors: Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You Title: Enhance-A-Video: Better Generated Video for Free Arxiv: http://arxiv.org/abs/2502.07508v1 Abstract: DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its sim...
2025-02-13
20 min
Daily Paper Cast
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
๐ค Upvotes: 12 | cs.CL Authors: Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam Title: ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization Arxiv: http://arxiv.org/abs/2502.04306v1 Abstract: Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address the...
2025-02-08
20 min
Daily Paper Cast
Process Reinforcement through Implicit Rewards
๐ค Upvotes: 44 | cs.LG, cs.AI, cs.CL Authors: Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding Title: Process Reinforcement through Implicit Rewards Arxiv: http://arxiv.org/abs/2502.01456v1 Abstract: Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large lan...
2025-02-05
21 min
Daily Paper Cast
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
๐ค Upvotes: 25 | cs.CL Authors: Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-Andrรฉ Noรซl, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar Title: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding Arxiv: http://arxiv.org/abs/2502.01341v1 Abstract: Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of su...
2025-02-05
23 min
Daily Paper Cast
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
๐ค Upvotes: 25 | cs.CR, cs.AI, cs.IR Authors: Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang, Shichao Song, Mengwei Wang, Jiawei Yang Title: SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model Arxiv: http://arxiv.org/abs/2501.18636v1 Abstract: The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because att...
2025-02-05
23 min
Daily Paper Cast
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
๐ค Upvotes: 22 | cs.CL Authors: Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu Title: Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs Arxiv: http://arxiv.org/abs/2501.18585v1 Abstract: Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1...
2025-02-01
23 min
Daily Paper Cast
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
๐ค Upvotes: 11 | cs.SD, cs.CL, eess.AS Authors: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu Title: Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation Arxiv: http://arxiv.org/abs/2501.15907v1 Abstract: Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their rel...
2025-01-29
22 min
Daily Paper Cast
Humanity's Last Exam
๐ค Upvotes: 33 | cs.LG, cs.AI, cs.CL Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan Kumar, Oleksandr Pokutnyi, Robert Gerbicz, Serguei Popov, John-Clark Levin, Mstyslav Kazakov, Johannes Schmitt, Geoff Galgon, Alvaro Sanchez, Yongki Lee, Will Yeadon, Scott Sauers, Marc Roth, Chidozie Agu, Sรธren Riis, Fabian Giska, Saiteja Utpa...
2025-01-28
22 min
Daily Paper Cast
Improving Video Generation with Human Feedback
๐ค Upvotes: 30 | cs.CV, cs.AI, cs.GR, cs.LG Authors: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang Title: Improving Video Generation with Human Feedback Arxiv: http://arxiv.org/abs/2501.13918v1 Abstract: Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a s...
2025-01-25
24 min
Daily Paper Cast
Temporal Preference Optimization for Long-Form Video Understanding
๐ค Upvotes: 15 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO Authors: Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy Title: Temporal Preference Optimization for Long-Form Video Understanding Arxiv: http://arxiv.org/abs/2501.13919v1 Abstract: Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a s...
2025-01-25
24 min
Daily Paper Cast
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt
๐ค Upvotes: 5 | cs.CV, cs.AI, cs.LG Authors: Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng Title: One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt Arxiv: http://arxiv.org/abs/2501.13554v1 Abstract: Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original mod...
2025-01-25
22 min
Daily Paper Cast
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
๐ค Upvotes: 109 | cs.CL, cs.AI, cs.LG Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Hua...
2025-01-24
21 min
Daily Paper Cast
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
๐ค Upvotes: 43 | cs.CL, cs.GR, cs.MA Authors: Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang Title: FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces Arxiv: http://arxiv.org/abs/2501.12909v1 Abstract: Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end...
2025-01-24
24 min
Daily Paper Cast
Kimi k1.5: Scaling Reinforcement Learning with LLMs
๐ค Upvotes: 39 | cs.AI, cs.LG Authors: Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Men...
2025-01-24
18 min
Daily Paper Cast
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
๐ค Upvotes: 31 | cs.AI, cs.CL, cs.CV, cs.HC Authors: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi Title: UI-TARS: Pioneering Automated GUI Interaction with Native Agents Arxiv: htt...
2025-01-23
20 min
Daily Paper Cast
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
๐ค Upvotes: 20 | cs.CL, cs.CV Authors: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji Title: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks Arxiv: http://arxiv.org/abs/2501.11733v1 Abstract: Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world hum...
2025-01-23
23 min
Daily Paper Cast
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
๐ค Upvotes: 16 | cs.CV Authors: Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jin...
2025-01-23
20 min
Daily Paper Cast
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
๐ค Upvotes: 21 | cs.CL, cs.AI, cs.HC, cs.SD, eess.AS Authors: Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou Title: MinMo: A Multimodal Large Language Model for Seamless Voi...
2025-01-15
23 min
Daily Paper Cast
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
๐ค Upvotes: 10 | cs.CV Authors: Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai Title: ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning Arxiv: http://arxiv.org/abs/2501.04698v1 Abstract: Text-to-video generation has made remarkable advancements through diffusion models. However, Multi-Concept Video Customization (MCVC) remains a significant challenge. We identify two key challenges in this task: 1) the identity decoupling problem, where directly adopting existing customization methods inevitably mix attributes when handling multiple concepts simultaneously, and 2) the sca...
2025-01-14
23 min
Daily Paper Cast
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
๐ค Upvotes: 32 | cs.CV Authors: Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang Title: MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Arxiv: http://arxiv.org/abs/2501.02955v1 Abstract: In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion com...
2025-01-09
22 min
Daily Paper Cast
Cosmos World Foundation Model Platform for Physical AI
๐ค Upvotes: 31 | cs.CV, cs.AI, cs.LG, cs.RO Authors: NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klรกr, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mous...
2025-01-09
25 min
Daily Paper Cast
TransPixar: Advancing Text-to-Video Generation with Transparency
๐ค Upvotes: 9 | cs.CV Authors: Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen Title: TransPixar: Advancing Text-to-Video Generation with Transparency Arxiv: http://arxiv.org/abs/2501.03006v1 Abstract: Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and ref...
2025-01-08
22 min
Daily Paper Cast
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
๐ค Upvotes: 12 | cs.CV, cs.AI Authors: Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen Title: Virgo: A Preliminary Exploration on Reproducing o1-like MLLM Arxiv: http://arxiv.org/abs/2501.01904v1 Abstract: Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data sem...
2025-01-07
22 min
Daily Paper Cast
On the Compositional Generalization of Multimodal LLMs for Medical Imaging
๐ค Upvotes: 29 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Zhenyang Cai, Junying Chen, Rongsheng Wang, Weihong Wang, Yonglin Deng, Dingjie Song, Yize Chen, Zixu Zhang, Benyou Wang Title: On the Compositional Generalization of Multimodal LLMs for Medical Imaging Arxiv: http://arxiv.org/abs/2412.20070v1 Abstract: Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting the need for understanding what kinds of images can be used by MLLMs for generalization. Cur...
2025-01-01
22 min
Daily Paper Cast
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
๐ค Upvotes: 53 | cs.CL, cs.AI, cs.LG Authors: Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, Benyou Wang Title: HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs Arxiv: http://arxiv.org/abs/2412.18925v1 Abstract: The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of...
2024-12-31
23 min
Daily Paper Cast
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
๐ค Upvotes: 11 | cs.CV Authors: Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang Title: Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment Arxiv: http://arxiv.org/abs/2412.19326v1 Abstract: Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the aut...
2024-12-31
25 min
Daily Paper Cast
DepthLab: From Partial to Complete
๐ค Upvotes: 21 | cs.CV Authors: Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo Title: DepthLab: From Partial to Complete Arxiv: http://arxiv.org/abs/2412.18153v1 Abstract: Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it...
2024-12-26
22 min
Daily Paper Cast
MotiF: Making Text Count in Image Animation with Motion Focal Loss
๐ค Upvotes: 3 | cs.CV, cs.AI Authors: Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin Title: MotiF: Making Text Count in Image Animation with Motion Focal Loss Arxiv: http://arxiv.org/abs/2412.16153v1 Abstract: Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce Mot...
2024-12-26
22 min
Daily Paper Cast
OpenAI o1 System Card
๐ค Upvotes: 12 | cs.AI Authors: OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Che...
2024-12-25
25 min
Daily Paper Cast
Outcome-Refining Process Supervision for Code Generation
๐ค Upvotes: 11 | cs.CL, cs.AI, cs.LG, cs.SE Authors: Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang Title: Outcome-Refining Process Supervision for Code Generation Arxiv: http://arxiv.org/abs/2412.15118v1 Abstract: Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Pro...
2024-12-25
21 min
Daily Paper Cast
LearnLM: Improving Gemini for Learning
๐ค Upvotes: 9 | cs.CY, cs.AI, cs.LG Authors: LearnLM Team, Abhinit Modi, Aditya Srikanth Veerubhotla, Aliya Rysbek, Andrea Huber, Brett Wiltshire, Brian Veprek, Daniel Gillick, Daniel Kasenberg, Derek Ahmed, Irina Jurenka, James Cohan, Jennifer She, Julia Wilkowski, Kaiz Alarakyia, Kevin McKee, Lisa Wang, Markus Kunesch, Mike Schaekermann, Miruna Pรฎslar, Nikhil Joshi, Parsa Mahmoudieh, Paul Jhun, Sara Wiltberger, Shakir Mohamed, Shashank Agarwal, Shubham Milind Phal, Sun Jae Lee, Theofilos Strinopoulos, Wei-Jen Ko, Amy Wang, Ankit Anand, Avishkar Bhoopchand, Dan Wild, Divya Pandya, Filip Bar, Garth Graham, Holger Winnemoeller, Mahvish Nagda, Prateek Kolhar, Renee Schneider, Shaojian Zhu, Step...
2024-12-25
27 min
Daily Paper Cast
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
๐ค Upvotes: 12 | cs.CV Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang Title: LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis Arxiv: http://arxiv.org/abs/2412.15214v1 Abstract: The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, suc...
2024-12-21
21 min
Daily Paper Cast
AniDoc: Animation Creation Made Easier
๐ค Upvotes: 29 | cs.CV Authors: Yihao Meng, Hao Ouyang, Hanlin Wang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Zhiheng Liu, Yujun Shen, Huamin Qu Title: AniDoc: Animation Creation Made Easier Arxiv: http://arxiv.org/abs/2412.14173v1 Abstract: The production of 2D animation follows an industry-standard workflow, encompassing four essential stages: character design, keyframe animation, in-betweening, and coloring. Our research focuses on reducing the labor costs in the above process by harnessing the potential of increasingly powerful generative AI. Using video diffusion models as the foundation, AniDoc emerges as a v...
2024-12-20
22 min
Daily Paper Cast
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
๐ค Upvotes: 11 | cs.CL, cs.AI, cs.LG Authors: Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang Title: SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models Arxiv: http://arxiv.org/abs/2412.11605v1 Abstract: Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference lea...
2024-12-18
23 min
Daily Paper Cast
Apollo: An Exploration of Video Understanding in Large Multimodal Models
๐ค Upvotes: 91 | cs.CV, cs.AI Authors: Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia Title: Apollo: An Exploration of Video Understanding in Large Multimodal Models Arxiv: http://arxiv.org/abs/2412.10360v1 Abstract: Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high com...
2024-12-17
24 min
Daily Paper Cast
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
๐ค Upvotes: 29 | cs.CV Authors: Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai Title: SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Arxiv: http://arxiv.org/abs/2412.09604v1 Abstract: The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. How...
2024-12-17
25 min
Daily Paper Cast
Multimodal Latent Language Modeling with Next-Token Diffusion
๐ค Upvotes: 21 | cs.CL, cs.CV, cs.LG Authors: Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei Title: Multimodal Latent Language Modeling with Next-Token Diffusion Arxiv: http://arxiv.org/abs/2412.08635v1 Abstract: Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a var...
2024-12-14
22 min
Daily Paper Cast
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
๐ค Upvotes: 16 | cs.CL Authors: Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu Title: AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials Arxiv: http://arxiv.org/abs/2412.09605v1 Abstract: Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive hum...
2024-12-14
18 min
Daily Paper Cast
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
๐ค Upvotes: 33 | cs.CV Authors: Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, Hao Li Title: LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment Arxiv: http://arxiv.org/abs/2412.04814v1 Abstract: Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a n...
2024-12-10
20 min
Daily Paper Cast
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
๐ค Upvotes: 32 | cs.CL Authors: Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong Title: Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction Arxiv: http://arxiv.org/abs/2412.04454v1 Abstract: Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representations of GUIs, which introduce limitations in generalization, efficiency, and scalability. In this paper, we introduce Aguvis, a uni...
2024-12-08
20 min
Daily Paper Cast
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
๐ค Upvotes: 32 | cs.RO, cs.AI, cs.CV, cs.LG Authors: Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, He Wang Title: Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection Arxiv: http://arxiv.org/abs/2412.04455v1 Abstract: Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the...
2024-12-08
22 min
Daily Paper Cast
Material Anything: Generating Materials for Any 3D Object via Diffusion
๐ค Paper Upvotes: 33 | cs.CV, cs.GR Authors: Xin Huang, Tengfei Wang, Ziwei Liu, Qing Wang Title: Material Anything: Generating Materials for Any 3D Object via Diffusion Arxiv: http://arxiv.org/abs/2411.15138v1 Abstract: We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, enhanced with a triple-head arc...
2024-11-27
21 min
Daily Paper Cast
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
๐ค Paper Upvotes: 42 | cs.CL, cs.CV Authors: Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai Title: Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Arxiv: http://arxiv.org/abs/2411.10442v1 Abstract: Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address thi...
2024-11-23
19 min
Daily Paper Cast
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
๐ค Paper Upvotes: 23 | cs.CL Authors: Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang Title: Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions Arxiv: http://arxiv.org/abs/2411.14405v1 Abstract: Currently OpenAI o1 has sparked a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- which are well-suited for reinforcement learning (RL) -- but also places gre...
2024-11-23
19 min
Daily Paper Cast
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
๐ค Paper Upvotes: 23 | cs.CV Authors: Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu Title: VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models Arxiv: http://arxiv.org/abs/2411.13503v1 Abstract: Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully ali...
2024-11-22
24 min
Daily Paper Cast
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning
๐ค Paper Upvotes: 5 | cs.CV Authors: Zewen Chen, Juan Wang, Wen Wang, Sunhan Xu, Hang Xiong, Yun Zeng, Jian Guo, Shuxun Wang, Chunfeng Yuan, Bing Li, Weiming Hu Title: SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning Arxiv: http://arxiv.org/abs/2411.10161v1 Abstract: Existing Image Quality Assessment (IQA) methods achieve remarkable success in analyzing quality for overall image, but few works explore quality analysis for Regions of Interest (ROIs). The quality analysis of ROIs can provide fine-grained guidance for image quality improvement and...
2024-11-21
20 min
Daily Paper Cast
AnimateAnything: Consistent and Controllable Animation for Video Generation
๐ค Paper Upvotes: 12 | cs.CV Authors: Guojun Lei, Chi Wang, Hong Li, Rong Zhang, Yikai Wang, Weiwei Xu Title: AnimateAnything: Consistent and Controllable Animation for Video Generation Arxiv: http://arxiv.org/abs/2411.10836v1 Abstract: We present a unified controllable video generation approach AnimateAnything that facilitates precise and consistent video manipulation across various conditions, including camera trajectories, text prompts, and user motion annotations. Specifically, we carefully design a multi-scale control feature fusion network to construct a common motion representation for different conditions. It explicitly converts all control information into fra...
2024-11-20
22 min
Daily Paper Cast
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
๐ค Paper Upvotes: 32 | cs.LG, cs.AI, cs.CL, cs.CV, 68T05, I.3.5; I.2.10; I.2.6 Authors: Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, Xiaohui Zeng Title: LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models Arxiv: http://arxiv.org/abs/2411.09595v1 Abstract: This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) ena...
2024-11-16
23 min
Daily Paper Cast
MagicQuill: An Intelligent Interactive Image Editing System
๐ค Paper Upvotes: 31 | cs.CV Authors: Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Wen Wang, Zhiheng Liu, Qifeng Chen, Yujun Shen Title: MagicQuill: An Intelligent Interactive Image Editing System Arxiv: http://arxiv.org/abs/2411.09703v1 Abstract: Image editing involves a variety of complex tasks and requires efficient and precise manipulation techniques. In this paper, we present MagicQuill, an integrated image editing system that enables swift actualization of creative ideas. Our system features a streamlined yet functionally robust interface, allowing for the articulation of editing ope...
2024-11-16
20 min
Daily Paper Cast
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
๐ค Paper Upvotes: 34 | cs.IR Authors: Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, Ji-Rong Wen Title: HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems Arxiv: http://arxiv.org/abs/2411.02959v1 Abstract: Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities and alleviate the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG systems, and many commercial systems such as ChatGPT and Perplexity have used Web search engines as their major retrieval sys...
2024-11-07
21 min
Daily Paper Cast
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
๐ค Paper Upvotes: 10 | cs.RO, cs.AI, cs.LG Authors: Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang Title: DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution Arxiv: http://arxiv.org/abs/2411.02359v1 Abstract: MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world rob...
2024-11-07
19 min
Daily Paper Cast
Training-free Regional Prompting for Diffusion Transformers
๐ค Paper Upvotes: 19 | cs.CV Authors: Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang Title: Training-free Regional Prompting for Diffusion Transformers Arxiv: http://arxiv.org/abs/2411.02395v1 Abstract: Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and...
2024-11-06
17 min
Daily Paper Cast
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
๐ค Paper Upvotes: 16 | cs.CL, cs.AI Authors: Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Jun Xia, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu Hu, Yiqi Chen, Yuchi Deng, Guiyang Li, Ao Liu...
2024-11-06
18 min
Daily Paper Cast
GenXD: Generating Any 3D and 4D Scenes
๐ค Paper Upvotes: 13 | cs.CV, cs.AI Authors: Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, Lijuan Wang Title: GenXD: Generating Any 3D and 4D Scenes Arxiv: http://arxiv.org/abs/2411.02319v2 Abstract: Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by lev...
2024-11-06
22 min
Daily Paper Cast
Personalization of Large Language Models: A Survey
๐ค Paper Upvotes: 14 | cs.CL Authors: Zhehao Zhang, Ryan A. Rossi, Branislav Kveton, Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernoncourt, Joe Barrow, Tong Yu, Sungchul Kim, Ruiyi Zhang, Jiuxiang Gu, Tyler Derr, Hongjie Chen, Junda Wu, Xiang Chen, Zichao Wang, Subrata Mitra, Nedim Lipka, Nesreen Ahmed, Yu Wang Title: Personalization of Large Language Models: A Survey Arxiv: http://arxiv.org/abs/2411.00027v1 Abstract: Personalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications. Despite the importance and recent progress, most exi...
2024-11-05
25 min
Daily Paper Cast (Test)
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
๐ค Daily Paper Upvotes: 11 Authors: Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu Categories: cs.CL, cs.AI, cs.CV, cs.LG Arxiv: http://arxiv.org/abs/2410.23918v1 Title: BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments Abstract: Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression rat...
2024-11-03
17 min