podcast
details
.com
Print
Share
Look for any podcast host, guest or anyone
Search
Showing episodes and shows of
Gengyu Wang
Shows
Daily Paper Cast
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
🤗 Upvotes: 79 | cs.CV, cs.CL Authors: Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang Title: Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models Arxiv: http://arxiv.org/abs/2505.04921v1 Abstract: Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize acr...
2025-05-10
23 min
Daily Paper Cast
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
🤗 Upvotes: 67 | cs.CV Authors: Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang Title: Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning Arxiv: http://arxiv.org/abs/2505.03318v1 Abstract: Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long cha...
2025-05-08
22 min
Daily Paper Cast
RM-R1: Reward Modeling as Reasoning
🤗 Upvotes: 48 | cs.CL, cs.AI, cs.LG Authors: Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji Title: RM-R1: Reward Modeling as Reasoning Arxiv: http://arxiv.org/abs/2505.02387v1 Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or...
2025-05-07
23 min
Daily Paper Cast
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CL Authors: Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen Title: Reinforcement Learning for Reasoning in Large Language Models with One Training Example Arxiv: http://arxiv.org/abs/2504.20571v1 Abstract: We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the...
2025-05-01
22 min
Daily Paper Cast
RepText: Rendering Visual Text via Replicating
🤗 Upvotes: 22 | cs.CV Authors: Haofan Wang, Yujia Xu, Yimeng Li, Junchen Li, Chaowei Zhang, Jing Wang, Kejia Yang, Zhibo Chen Title: RepText: Rendering Visual Text via Replicating Arxiv: http://arxiv.org/abs/2504.19724v1 Abstract: Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary con...
2025-04-30
21 min
Daily Paper Cast
Step1X-Edit: A Practical Framework for General Image Editing
🤗 Upvotes: 55 | cs.CV Authors: Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, Daxin Jiang Title: Step1X-Edit: A Practical Framework for General Image Editing Arxiv: http://arxiv.org/abs/2504.17761v1 Abstract: In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal mod...
2025-04-26
20 min
Daily Paper Cast
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
🤗 Upvotes: 51 | cs.CL Authors: Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang Title: The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks Arxiv: http://arxiv.org/abs/2504.15521v1 Abstract: As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal tha...
2025-04-24
22 min
Daily Paper Cast
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
🤗 Upvotes: 50 | cs.CV Authors: Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, Tyler Poon, Max Ehrlich, Tuomas Rintamaki, Tyler Poon, Tong Lu, Limin Wang, Bryan Catanzaro, Jan Kautz, Andrew Tao, Zhiding Yu, Guilin Liu Title: Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models Arxiv: http://arxiv.org/abs/2504.15271v1 Abstract: We introduce Eagle 2.5, a family of frontier vision-language models (VLMs) for long-context multimodal learning. Our work addresses the challenges in long video comprehension and high-resolution ima...
2025-04-23
20 min
Daily Paper Cast
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
🤗 Upvotes: 172 | cs.CV Authors: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Han Lv, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dah...
2025-04-16
22 min
Daily Paper Cast
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
🤗 Upvotes: 83 | cs.CV, cs.AI Authors: Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Zhiwu Qing, Fei Xiao, Meng Wei, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Xiaobin Zhuang, Xiaojie Li, Xin Xia, Xuyan Chi, Yan...
2025-04-15
22 min
Daily Paper Cast
Kimi-VL Technical Report
🤗 Upvotes: 71 | cs.CV Authors: Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo...
2025-04-12
23 min
Daily Paper Cast
DDT: Decoupled Diffusion Transformer
🤗 Upvotes: 51 | cs.CV, cs.AI Authors: Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang Title: DDT: Decoupled Diffusion Transformer Arxiv: http://arxiv.org/abs/2504.05741v2 Abstract: Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and hig...
2025-04-11
19 min
Daily Paper Cast
A Unified Agentic Framework for Evaluating Conditional Image Generation
🤗 Upvotes: 25 | cs.CV, cs.CL Authors: Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang Title: A Unified Agentic Framework for Evaluating Conditional Image Generation Arxiv: http://arxiv.org/abs/2504.07046v1 Abstract: Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes lar...
2025-04-11
21 min
Daily Paper Cast
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
🤗 Upvotes: 36 | cs.CL Authors: M-A-P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin Title: COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values Arxiv: http://arxiv.org/abs/2504.05535v1 A...
2025-04-10
21 min
Daily Paper Cast
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
🤗 Upvotes: 98 | cs.AI Authors: Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianming Liu, Tongliang Liu, Yu Su, Huan Sun, Glen Berseth, Jianyun Nie, Ian Foster, Logan Ward, Qingyun Wu, Yu Gu, Mingchen Zhuge, Xiangru Tang, Haohan Wang, Jiaxuan You, Chi Wang, Jian Pei, Qiang Yang, Xiaoliang Qi, Che...
2025-04-05
20 min
Daily Paper Cast
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
🤗 Upvotes: 57 | cs.CV, cs.AI Authors: Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, Zhen Lei Title: MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization Arxiv: http://arxiv.org/abs/2504.00999v1 Abstract: Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. rep...
2025-04-04
19 min
Daily Paper Cast
EgoLife: Towards Egocentric Life Assistant
🤗 Upvotes: 21 | cs.CV Authors: Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu Title: EgoLife: Towards Egocentric Life Assistant Arxiv: http://arxiv.org/abs/2503.03803v1 Abstract: We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for thi...
2025-03-08
22 min
Daily Paper Cast
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
🤗 Upvotes: 42 | cs.CL, cs.AI, cs.LG Authors: Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Ali Mousavi, Anh Ngu...
2025-03-05
25 min
Daily Paper Cast
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
🤗 Upvotes: 81 | cs.CL Authors: M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li...
2025-02-22
24 min
Daily Paper Cast
Qwen2.5-VL Technical Report
🤗 Upvotes: 97 | cs.CV, cs.CL Authors: Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin Title: Qwen2.5-VL Technical Report Arxiv: http://arxiv.org/abs/2502.13923v1 Abstract: We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant adv...
2025-02-21
21 min
Daily Paper Cast
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
🤗 Upvotes: 15 | cs.LG, cs.AI, cs.DC Authors: Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, Ion Stoica Title: Autellix: An Efficient Serving Engine for LLM Agents as General Programs Arxiv: http://arxiv.org/abs/2502.13965v1 Abstract: Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ign...
2025-02-21
22 min
Daily Paper Cast
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
🤗 Upvotes: 68 | cs.CL, cs.AI, cs.LG Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng Title: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Arxiv: http://arxiv.org/abs/2502.11089v1 Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while mai...
2025-02-19
23 min
Daily Paper Cast
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
🤗 Upvotes: 41 | cs.AI Authors: Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez Title: The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks Arxiv: http://arxiv.org/abs/2502.08235v1 Abstract: Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited. This paper introduces and analyzes overthinking in LRMs. A phenomenon whe...
2025-02-18
27 min
Daily Paper Cast
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
🤗 Upvotes: 38 | cs.CV, cs.CL Authors: Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Hen...
2025-02-18
23 min
Daily Paper Cast
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
🤗 Upvotes: 22 | cs.CL, cs.CV Authors: Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan Title: MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Arxiv: http://arxiv.org/abs/2502.10391v1 Abstract: Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because cur...
2025-02-18
23 min
Daily Paper Cast
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
🤗 Upvotes: 20 | cs.AI, cs.CL, cs.CV Authors: Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang Title: EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents Arxiv: http://arxiv.org/abs/2502.09560v1 Abstract: Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the...
2025-02-15
21 min
Daily Paper Cast
Exploring the Potential of Encoder-free Architectures in 3D LMMs
🤗 Upvotes: 17 | cs.CV, cs.AI, cs.CL Authors: Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao Title: Exploring the Potential of Encoder-free Architectures in 3D LMMs Arxiv: http://arxiv.org/abs/2502.09620v1 Abstract: Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of enc...
2025-02-15
18 min
Daily Paper Cast
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
🤗 Upvotes: 16 | cs.CL, cs.AI Authors: Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou Title: CoSER: Coordinating LLM-Based Persona Simulation of Established Roles Arxiv: http://arxiv.org/abs/2502.09082v1 Abstract: Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this pap...
2025-02-15
22 min
Daily Paper Cast
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
🤗 Upvotes: 35 | cs.CV Authors: Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li Title: TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation Arxiv: http://arxiv.org/abs/2502.07870v1 Abstract: Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and...
2025-02-14
18 min
Daily Paper Cast
CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
🤗 Upvotes: 29 | cs.CV Authors: Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai Title: CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation Arxiv: http://arxiv.org/abs/2502.08639v1 Abstract: In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and...
2025-02-14
23 min
Daily Paper Cast
Enhance-A-Video: Better Generated Video for Free
🤗 Upvotes: 14 | cs.CV Authors: Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You Title: Enhance-A-Video: Better Generated Video for Free Arxiv: http://arxiv.org/abs/2502.07508v1 Abstract: DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its sim...
2025-02-13
20 min
Daily Paper Cast
ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
🤗 Upvotes: 12 | cs.CL Authors: Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam Title: ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization Arxiv: http://arxiv.org/abs/2502.04306v1 Abstract: Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address the...
2025-02-08
20 min
Daily Paper Cast
Process Reinforcement through Implicit Rewards
🤗 Upvotes: 44 | cs.LG, cs.AI, cs.CL Authors: Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding Title: Process Reinforcement through Implicit Rewards Arxiv: http://arxiv.org/abs/2502.01456v1 Abstract: Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large lan...
2025-02-05
21 min
Daily Paper Cast
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
🤗 Upvotes: 25 | cs.CL Authors: Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar Title: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding Arxiv: http://arxiv.org/abs/2502.01341v1 Abstract: Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of su...
2025-02-05
23 min
Daily Paper Cast
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
🤗 Upvotes: 25 | cs.CR, cs.AI, cs.IR Authors: Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang, Shichao Song, Mengwei Wang, Jiawei Yang Title: SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model Arxiv: http://arxiv.org/abs/2501.18636v1 Abstract: The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because att...
2025-02-05
23 min
Daily Paper Cast
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
🤗 Upvotes: 22 | cs.CL Authors: Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu Title: Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs Arxiv: http://arxiv.org/abs/2501.18585v1 Abstract: Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1...
2025-02-01
23 min
Daily Paper Cast
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
🤗 Upvotes: 11 | cs.SD, cs.CL, eess.AS Authors: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu Title: Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation Arxiv: http://arxiv.org/abs/2501.15907v1 Abstract: Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their rel...
2025-01-29
22 min
Daily Paper Cast
Humanity's Last Exam
🤗 Upvotes: 33 | cs.LG, cs.AI, cs.CL Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan Kumar, Oleksandr Pokutnyi, Robert Gerbicz, Serguei Popov, John-Clark Levin, Mstyslav Kazakov, Johannes Schmitt, Geoff Galgon, Alvaro Sanchez, Yongki Lee, Will Yeadon, Scott Sauers, Marc Roth, Chidozie Agu, Søren Riis, Fabian Giska, Saiteja Utpa...
2025-01-28
22 min
Daily Paper Cast
Improving Video Generation with Human Feedback
🤗 Upvotes: 30 | cs.CV, cs.AI, cs.GR, cs.LG Authors: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang Title: Improving Video Generation with Human Feedback Arxiv: http://arxiv.org/abs/2501.13918v1 Abstract: Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a s...
2025-01-25
24 min
Daily Paper Cast
Temporal Preference Optimization for Long-Form Video Understanding
🤗 Upvotes: 15 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO Authors: Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy Title: Temporal Preference Optimization for Long-Form Video Understanding Arxiv: http://arxiv.org/abs/2501.13919v1 Abstract: Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a s...
2025-01-25
24 min
Daily Paper Cast
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt
🤗 Upvotes: 5 | cs.CV, cs.AI, cs.LG Authors: Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng Title: One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt Arxiv: http://arxiv.org/abs/2501.13554v1 Abstract: Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original mod...
2025-01-25
22 min
Daily Paper Cast
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
🤗 Upvotes: 109 | cs.CL, cs.AI, cs.LG Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Hua...
2025-01-24
21 min
Daily Paper Cast
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
🤗 Upvotes: 43 | cs.CL, cs.GR, cs.MA Authors: Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang Title: FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces Arxiv: http://arxiv.org/abs/2501.12909v1 Abstract: Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end...
2025-01-24
24 min
Daily Paper Cast
Kimi k1.5: Scaling Reinforcement Learning with LLMs
🤗 Upvotes: 39 | cs.AI, cs.LG Authors: Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Men...
2025-01-24
18 min
Daily Paper Cast
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
🤗 Upvotes: 31 | cs.AI, cs.CL, cs.CV, cs.HC Authors: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi Title: UI-TARS: Pioneering Automated GUI Interaction with Native Agents Arxiv: htt...
2025-01-23
20 min
Daily Paper Cast
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
🤗 Upvotes: 20 | cs.CL, cs.CV Authors: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji Title: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks Arxiv: http://arxiv.org/abs/2501.11733v1 Abstract: Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world hum...
2025-01-23
23 min
Daily Paper Cast
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
🤗 Upvotes: 16 | cs.CV Authors: Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jin...
2025-01-23
20 min
Daily Paper Cast
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
🤗 Upvotes: 21 | cs.CL, cs.AI, cs.HC, cs.SD, eess.AS Authors: Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou Title: MinMo: A Multimodal Large Language Model for Seamless Voi...
2025-01-15
23 min
Daily Paper Cast
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
🤗 Upvotes: 10 | cs.CV Authors: Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai Title: ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning Arxiv: http://arxiv.org/abs/2501.04698v1 Abstract: Text-to-video generation has made remarkable advancements through diffusion models. However, Multi-Concept Video Customization (MCVC) remains a significant challenge. We identify two key challenges in this task: 1) the identity decoupling problem, where directly adopting existing customization methods inevitably mix attributes when handling multiple concepts simultaneously, and 2) the sca...
2025-01-14
23 min
Daily Paper Cast
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
🤗 Upvotes: 32 | cs.CV Authors: Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang Title: MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Arxiv: http://arxiv.org/abs/2501.02955v1 Abstract: In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion com...
2025-01-09
22 min
Daily Paper Cast
Cosmos World Foundation Model Platform for Physical AI
🤗 Upvotes: 31 | cs.CV, cs.AI, cs.LG, cs.RO Authors: NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mous...
2025-01-09
25 min
Daily Paper Cast
TransPixar: Advancing Text-to-Video Generation with Transparency
🤗 Upvotes: 9 | cs.CV Authors: Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen Title: TransPixar: Advancing Text-to-Video Generation with Transparency Arxiv: http://arxiv.org/abs/2501.03006v1 Abstract: Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and ref...
2025-01-08
22 min
Daily Paper Cast
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
🤗 Upvotes: 12 | cs.CV, cs.AI Authors: Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen Title: Virgo: A Preliminary Exploration on Reproducing o1-like MLLM Arxiv: http://arxiv.org/abs/2501.01904v1 Abstract: Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data sem...
2025-01-07
22 min
Daily Paper Cast
On the Compositional Generalization of Multimodal LLMs for Medical Imaging
🤗 Upvotes: 29 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Zhenyang Cai, Junying Chen, Rongsheng Wang, Weihong Wang, Yonglin Deng, Dingjie Song, Yize Chen, Zixu Zhang, Benyou Wang Title: On the Compositional Generalization of Multimodal LLMs for Medical Imaging Arxiv: http://arxiv.org/abs/2412.20070v1 Abstract: Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting the need for understanding what kinds of images can be used by MLLMs for generalization. Cur...
2025-01-01
22 min
Daily Paper Cast
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
🤗 Upvotes: 53 | cs.CL, cs.AI, cs.LG Authors: Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, Benyou Wang Title: HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs Arxiv: http://arxiv.org/abs/2412.18925v1 Abstract: The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of...
2024-12-31
23 min
Daily Paper Cast
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
🤗 Upvotes: 11 | cs.CV Authors: Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang Title: Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment Arxiv: http://arxiv.org/abs/2412.19326v1 Abstract: Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the aut...
2024-12-31
25 min
Daily Paper Cast
DepthLab: From Partial to Complete
🤗 Upvotes: 21 | cs.CV Authors: Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo Title: DepthLab: From Partial to Complete Arxiv: http://arxiv.org/abs/2412.18153v1 Abstract: Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it...
2024-12-26
22 min
Daily Paper Cast
MotiF: Making Text Count in Image Animation with Motion Focal Loss
🤗 Upvotes: 3 | cs.CV, cs.AI Authors: Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin Title: MotiF: Making Text Count in Image Animation with Motion Focal Loss Arxiv: http://arxiv.org/abs/2412.16153v1 Abstract: Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce Mot...
2024-12-26
22 min
Daily Paper Cast
OpenAI o1 System Card
🤗 Upvotes: 12 | cs.AI Authors: OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Che...
2024-12-25
25 min
Daily Paper Cast
Outcome-Refining Process Supervision for Code Generation
🤗 Upvotes: 11 | cs.CL, cs.AI, cs.LG, cs.SE Authors: Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang Title: Outcome-Refining Process Supervision for Code Generation Arxiv: http://arxiv.org/abs/2412.15118v1 Abstract: Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Pro...
2024-12-25
21 min
Daily Paper Cast
LearnLM: Improving Gemini for Learning
🤗 Upvotes: 9 | cs.CY, cs.AI, cs.LG Authors: LearnLM Team, Abhinit Modi, Aditya Srikanth Veerubhotla, Aliya Rysbek, Andrea Huber, Brett Wiltshire, Brian Veprek, Daniel Gillick, Daniel Kasenberg, Derek Ahmed, Irina Jurenka, James Cohan, Jennifer She, Julia Wilkowski, Kaiz Alarakyia, Kevin McKee, Lisa Wang, Markus Kunesch, Mike Schaekermann, Miruna Pîslar, Nikhil Joshi, Parsa Mahmoudieh, Paul Jhun, Sara Wiltberger, Shakir Mohamed, Shashank Agarwal, Shubham Milind Phal, Sun Jae Lee, Theofilos Strinopoulos, Wei-Jen Ko, Amy Wang, Ankit Anand, Avishkar Bhoopchand, Dan Wild, Divya Pandya, Filip Bar, Garth Graham, Holger Winnemoeller, Mahvish Nagda, Prateek Kolhar, Renee Schneider, Shaojian Zhu, Step...
2024-12-25
27 min
Daily Paper Cast
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
🤗 Upvotes: 12 | cs.CV Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang Title: LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis Arxiv: http://arxiv.org/abs/2412.15214v1 Abstract: The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, suc...
2024-12-21
21 min
Daily Paper Cast
AniDoc: Animation Creation Made Easier
🤗 Upvotes: 29 | cs.CV Authors: Yihao Meng, Hao Ouyang, Hanlin Wang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Zhiheng Liu, Yujun Shen, Huamin Qu Title: AniDoc: Animation Creation Made Easier Arxiv: http://arxiv.org/abs/2412.14173v1 Abstract: The production of 2D animation follows an industry-standard workflow, encompassing four essential stages: character design, keyframe animation, in-betweening, and coloring. Our research focuses on reducing the labor costs in the above process by harnessing the potential of increasingly powerful generative AI. Using video diffusion models as the foundation, AniDoc emerges as a v...
2024-12-20
22 min
Daily Paper Cast
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
🤗 Upvotes: 11 | cs.CL, cs.AI, cs.LG Authors: Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang Title: SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models Arxiv: http://arxiv.org/abs/2412.11605v1 Abstract: Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference lea...
2024-12-18
23 min
Daily Paper Cast
Apollo: An Exploration of Video Understanding in Large Multimodal Models
🤗 Upvotes: 91 | cs.CV, cs.AI Authors: Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia Title: Apollo: An Exploration of Video Understanding in Large Multimodal Models Arxiv: http://arxiv.org/abs/2412.10360v1 Abstract: Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high com...
2024-12-17
24 min
Daily Paper Cast
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
🤗 Upvotes: 29 | cs.CV Authors: Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai Title: SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Arxiv: http://arxiv.org/abs/2412.09604v1 Abstract: The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. How...
2024-12-17
25 min
Daily Paper Cast
Multimodal Latent Language Modeling with Next-Token Diffusion
🤗 Upvotes: 21 | cs.CL, cs.CV, cs.LG Authors: Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei Title: Multimodal Latent Language Modeling with Next-Token Diffusion Arxiv: http://arxiv.org/abs/2412.08635v1 Abstract: Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a var...
2024-12-14
22 min
Daily Paper Cast
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
🤗 Upvotes: 16 | cs.CL Authors: Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu Title: AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials Arxiv: http://arxiv.org/abs/2412.09605v1 Abstract: Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive hum...
2024-12-14
18 min
Daily Paper Cast
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
🤗 Upvotes: 33 | cs.CV Authors: Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, Hao Li Title: LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment Arxiv: http://arxiv.org/abs/2412.04814v1 Abstract: Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a n...
2024-12-10
20 min
Daily Paper Cast
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
🤗 Upvotes: 32 | cs.CL Authors: Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong Title: Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction Arxiv: http://arxiv.org/abs/2412.04454v1 Abstract: Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representations of GUIs, which introduce limitations in generalization, efficiency, and scalability. In this paper, we introduce Aguvis, a uni...
2024-12-08
20 min
Daily Paper Cast
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection
🤗 Upvotes: 32 | cs.RO, cs.AI, cs.CV, cs.LG Authors: Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, He Wang Title: Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection Arxiv: http://arxiv.org/abs/2412.04455v1 Abstract: Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the...
2024-12-08
22 min
Daily Paper Cast
Material Anything: Generating Materials for Any 3D Object via Diffusion
🤗 Paper Upvotes: 33 | cs.CV, cs.GR Authors: Xin Huang, Tengfei Wang, Ziwei Liu, Qing Wang Title: Material Anything: Generating Materials for Any 3D Object via Diffusion Arxiv: http://arxiv.org/abs/2411.15138v1 Abstract: We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, enhanced with a triple-head arc...
2024-11-27
21 min
Daily Paper Cast
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
🤗 Paper Upvotes: 42 | cs.CL, cs.CV Authors: Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai Title: Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Arxiv: http://arxiv.org/abs/2411.10442v1 Abstract: Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address thi...
2024-11-23
19 min
Daily Paper Cast
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
🤗 Paper Upvotes: 23 | cs.CL Authors: Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang Title: Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions Arxiv: http://arxiv.org/abs/2411.14405v1 Abstract: Currently OpenAI o1 has sparked a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- which are well-suited for reinforcement learning (RL) -- but also places gre...
2024-11-23
19 min
Daily Paper Cast
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
🤗 Paper Upvotes: 23 | cs.CV Authors: Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu Title: VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models Arxiv: http://arxiv.org/abs/2411.13503v1 Abstract: Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully ali...
2024-11-22
24 min
Daily Paper Cast
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning
🤗 Paper Upvotes: 5 | cs.CV Authors: Zewen Chen, Juan Wang, Wen Wang, Sunhan Xu, Hang Xiong, Yun Zeng, Jian Guo, Shuxun Wang, Chunfeng Yuan, Bing Li, Weiming Hu Title: SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning Arxiv: http://arxiv.org/abs/2411.10161v1 Abstract: Existing Image Quality Assessment (IQA) methods achieve remarkable success in analyzing quality for overall image, but few works explore quality analysis for Regions of Interest (ROIs). The quality analysis of ROIs can provide fine-grained guidance for image quality improvement and...
2024-11-21
20 min
Daily Paper Cast
AnimateAnything: Consistent and Controllable Animation for Video Generation
🤗 Paper Upvotes: 12 | cs.CV Authors: Guojun Lei, Chi Wang, Hong Li, Rong Zhang, Yikai Wang, Weiwei Xu Title: AnimateAnything: Consistent and Controllable Animation for Video Generation Arxiv: http://arxiv.org/abs/2411.10836v1 Abstract: We present a unified controllable video generation approach AnimateAnything that facilitates precise and consistent video manipulation across various conditions, including camera trajectories, text prompts, and user motion annotations. Specifically, we carefully design a multi-scale control feature fusion network to construct a common motion representation for different conditions. It explicitly converts all control information into fra...
2024-11-20
22 min
Daily Paper Cast
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
🤗 Paper Upvotes: 32 | cs.LG, cs.AI, cs.CL, cs.CV, 68T05, I.3.5; I.2.10; I.2.6 Authors: Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, Xiaohui Zeng Title: LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models Arxiv: http://arxiv.org/abs/2411.09595v1 Abstract: This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) ena...
2024-11-16
23 min
Daily Paper Cast
MagicQuill: An Intelligent Interactive Image Editing System
🤗 Paper Upvotes: 31 | cs.CV Authors: Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Wen Wang, Zhiheng Liu, Qifeng Chen, Yujun Shen Title: MagicQuill: An Intelligent Interactive Image Editing System Arxiv: http://arxiv.org/abs/2411.09703v1 Abstract: Image editing involves a variety of complex tasks and requires efficient and precise manipulation techniques. In this paper, we present MagicQuill, an integrated image editing system that enables swift actualization of creative ideas. Our system features a streamlined yet functionally robust interface, allowing for the articulation of editing ope...
2024-11-16
20 min
Daily Paper Cast
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
🤗 Paper Upvotes: 20 | cs.CV Authors: Wenhao Wang, Yi Yang Title: TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation Arxiv: http://arxiv.org/abs/2411.04709v1 Abstract: Video generation models are revolutionizing content creation, with image-to-video models drawing increasing attention due to their enhanced controllability, visual consistency, and practical applications. However, despite their popularity, these models rely on user-provided text and image prompts, and there is currently no dedicated dataset for studying these prompts. In this paper, we introduce TIP-I2V, the first large-scale dat...
2024-11-09
24 min
Daily Paper Cast
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
🤗 Paper Upvotes: 33 | cs.CV, cs.AI, cs.CL, cs.MM Authors: Dingjie Song, Sicheng Lai, Shunian Chen, Lichao Sun, Benyou Wang Title: Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination Arxiv: http://arxiv.org/abs/2411.03823v1 Abstract: The rapid progression of multimodal large language models (MLLMs) has demonstrated superior performance on various multimodal benchmarks. However, the issue of data contamination during training creates challenges in performance evaluation and comparison. While numerous methods exist for detecting dataset contamination in large language models (LLMs), the...
2024-11-08
23 min
Daily Paper Cast
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
🤗 Paper Upvotes: 26 | cs.LG, cs.AI Authors: Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, Jun Wang Title: Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level Arxiv: http://arxiv.org/abs/2411.03562v1 Abstract: We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Ful...
2024-11-08
20 min
Daily Paper Cast
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
🤗 Paper Upvotes: 10 | cs.CL, cs.AI, cs.LG Authors: Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma Title: Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models Arxiv: http://arxiv.org/abs/2411.03884v1 Abstract: Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enh...
2024-11-08
23 min
Daily Paper Cast
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems
🤗 Paper Upvotes: 34 | cs.IR Authors: Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, Ji-Rong Wen Title: HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems Arxiv: http://arxiv.org/abs/2411.02959v1 Abstract: Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities and alleviate the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG systems, and many commercial systems such as ChatGPT and Perplexity have used Web search engines as their major retrieval sys...
2024-11-07
21 min
Daily Paper Cast
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
🤗 Paper Upvotes: 10 | cs.RO, cs.AI, cs.LG Authors: Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang Title: DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution Arxiv: http://arxiv.org/abs/2411.02359v1 Abstract: MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world rob...
2024-11-07
19 min
Daily Paper Cast
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
🤗 Paper Upvotes: 20 | cs.CV Authors: Wei Cheng, Juncheng Mu, Xianfang Zeng, Xin Chen, Anqi Pang, Chi Zhang, Zhibin Wang, Bin Fu, Gang Yu, Ziwei Liu, Liang Pan Title: MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D Arxiv: http://arxiv.org/abs/2411.02336v1 Abstract: Texturing is a crucial step in the 3D asset production workflow, which enhances the visual appeal and diversity of 3D assets. Despite recent advancements in Text-to-Texture (T2T) generation, existing methods often yield subpar results, primarily due to local discontinuities, inconsistencies across multiple views, and...
2024-11-06
21 min
Daily Paper Cast
Training-free Regional Prompting for Diffusion Transformers
🤗 Paper Upvotes: 19 | cs.CV Authors: Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang Title: Training-free Regional Prompting for Diffusion Transformers Arxiv: http://arxiv.org/abs/2411.02395v1 Abstract: Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and...
2024-11-06
17 min
Daily Paper Cast
How Far is Video Generation from World Model: A Physical Law Perspective
🤗 Paper Upvotes: 19 | cs.CV, cs.AI Authors: Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng Title: How Far is Video Generation from World Model: A Physical Law Perspective Arxiv: http://arxiv.org/abs/2411.02385v1 Abstract: OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the tru...
2024-11-06
23 min
Daily Paper Cast
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
🤗 Paper Upvotes: 16 | cs.CL, cs.AI Authors: Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Jun Xia, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu Hu, Yiqi Chen, Yuchi Deng, Guiyang Li, Ao Liu...
2024-11-06
18 min
Daily Paper Cast
GenXD: Generating Any 3D and 4D Scenes
🤗 Paper Upvotes: 13 | cs.CV, cs.AI Authors: Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, Lijuan Wang Title: GenXD: Generating Any 3D and 4D Scenes Arxiv: http://arxiv.org/abs/2411.02319v2 Abstract: Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by lev...
2024-11-06
22 min
Daily Paper Cast
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
🤗 Paper Upvotes: 32 | cs.CL, cs.CV, cs.HC Authors: Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao Title: OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Arxiv: http://arxiv.org/abs/2410.23218v1 Abstract: Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to...
2024-11-05
20 min
Daily Paper Cast
Personalization of Large Language Models: A Survey
🤗 Paper Upvotes: 14 | cs.CL Authors: Zhehao Zhang, Ryan A. Rossi, Branislav Kveton, Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernoncourt, Joe Barrow, Tong Yu, Sungchul Kim, Ruiyi Zhang, Jiuxiang Gu, Tyler Derr, Hongjie Chen, Junda Wu, Xiang Chen, Zichao Wang, Subrata Mitra, Nedim Lipka, Nesreen Ahmed, Yu Wang Title: Personalization of Large Language Models: A Survey Arxiv: http://arxiv.org/abs/2411.00027v1 Abstract: Personalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications. Despite the importance and recent progress, most exi...
2024-11-05
25 min
Daily Paper Cast
In-Context LoRA for Diffusion Transformers
🤗 Paper Upvotes: 7 | cs.CV, cs.GR Authors: Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, Jingren Zhou Title: In-Context LoRA for Diffusion Transformers Arxiv: http://arxiv.org/abs/2410.23775v2 Abstract: Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inh...
2024-11-05
20 min
Daily Paper Cast (Test)
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions
🤗 Paper Upvotes: 31 | cs.AI, cs.CL Authors: Ziming Li, Qianbo Zang, David Ma, Jiawei Guo, Tuney Zheng, Minghao Liu, Xinyao Niu, Yue Wang, Jian Yang, Jiaheng Liu, Wanjun Zhong, Wangchunshu Zhou, Wenhao Huang, Ge Zhang Title: AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions Arxiv: http://arxiv.org/abs/2410.20424v2 Abstract: Data science tasks involving tabular data present complex challenges that require sophisticated problem-solving approaches. We propose AutoKaggle, a powerful and user-centric framework that assists data scientists in completing daily data pipelines through a collaborative multi-agent sys...
2024-11-04
04 min
Daily Paper Cast
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
🤗 Daily Paper Upvotes: 12 Authors: Yunjia Qi, Hao Peng, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li Categories: cs.CL, cs.AI Arxiv: http://arxiv.org/abs/2410.24175v1 Title: Constraint Back-translation Improves Complex Instruction Following of Large Language Models Abstract: Large language models (LLMs) struggle to follow instructions with complex constraints in format, length, etc. Following the conventional instruction-tuning practice, previous works conduct post-training on complex instruction-response pairs generated by feeding complex instructions to advanced LLMs. However, even advanced LLMs cannot follow complex instructions well, thus limiting the quality of generated data. In this work, we find that existing datasets inherently con...
2024-11-03
19 min
Daily Paper Cast
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
🤗 Daily Paper Upvotes: 11 Authors: Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu Categories: cs.CL, cs.AI, cs.CV, cs.LG Arxiv: http://arxiv.org/abs/2410.23918v1 Title: BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments Abstract: Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression rat...
2024-11-03
17 min
Daily Paper Cast (Test)
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments
🤗 Daily Paper Upvotes: 11 Authors: Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu Categories: cs.CL, cs.AI, cs.CV, cs.LG Arxiv: http://arxiv.org/abs/2410.23918v1 Title: BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments Abstract: Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression rat...
2024-11-03
17 min
Daily Paper Cast (Test)
AAAR-1.0: Assessing AI's Potential to Assist Research
🤗 Daily Paper Upvotes: 10 Authors: Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin Categories: cs.CL Arxiv: http://arxiv.org/abs/2410.22394v1 Title: AAAR-1.0: Assessing AI's Potential to Assist Research Abstract: Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own wor...
2024-11-03
22 min
Daily Paper Cast
AAAR-1.0: Assessing AI's Potential to Assist Research
🤗 Daily Paper Upvotes: 10 Authors: Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin Categories: cs.CL Arxiv: http://arxiv.org/abs/2410.22394v1 Title: AAAR-1.0: Assessing AI's Potential to Assist Research Abstract: Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own wor...
2024-11-03
22 min
Daily Paper Cast
AAAR-1.0: Assessing AI's Potential to Assist Research
🤗 Daily Paper Upvotes: 10 Authors: Renze Lou, Hanzi Xu, Sijia Wang, Jiangshu Du, Ryo Kamoi, Xiaoxin Lu, Jian Xie, Yuxuan Sun, Yusen Zhang, Jihyun Janice Ahn, Hongchao Fang, Zhuoyang Zou, Wenchao Ma, Xi Li, Kai Zhang, Congying Xia, Lifu Huang, Wenpeng Yin Categories: cs.CL Arxiv: http://arxiv.org/abs/2410.22394v1 Title: AAAR-1.0: Assessing AI's Potential to Assist Research Abstract: Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own wor...
2024-11-03
22 min