Look for any podcast host, guest or anyone
Showing episodes and shows of

Gengyu Wang

Shows

Daily Paper CastDaily Paper CastStep-Audio 2 Technical Report ๐Ÿค— Upvotes: 42 | cs.CL, cs.SD, eess.AS Authors: Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li, Changyi Wan, Hanpeng Hu, Jiangjie Zhen, Siyu Chen, Song Yuan, Xuelin Zhang, Yimin Jiang, Yu Zhou, Yuxiang Yang, Bingxin Li, Buyun Ma, Changhe Song, Dongqing Pang, Guoqiang Hu, Haiyang Sun, Kang An, Na Wang, Shuli Gao, Wei Ji, Wen Li, Wen Sun, Xuan Wen...2025-07-2422 minDaily Paper CastDaily Paper CastAnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning ๐Ÿค— Upvotes: 30 | cs.CV Authors: Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu Title: AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning Arxiv: http://arxiv.org/abs/2507.12841v1 Abstract: Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce Any...2025-07-1922 minDaily Paper CastDaily Paper CastTest-Time Scaling with Reflective Generative Model ๐Ÿค— Upvotes: 68 | cs.LG, cs.CL Authors: Zixiao Wang, Yuxin Wang, Xiaorui Wang, Mengting Xing, Jie Gao, Jianjun Xu, Guangcan Liu, Chenhui Jin, Zhuo Wang, Shengzhuo Zhang, Hongtao Xie Title: Test-Time Scaling with Reflective Generative Model Arxiv: http://arxiv.org/abs/2507.01951v2 Abstract: We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the bac...2025-07-1521 minDaily Paper CastDaily Paper CastGemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities ๐Ÿค— Upvotes: 24 | cs.CL, cs.AI Authors: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaรฏ Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramรฉ, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian...2025-07-1520 minDaily Paper CastDaily Paper CastTraceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology ๐Ÿค— Upvotes: 37 | cs.CV, cs.AI, cs.CL Authors: Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang Title: Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Arxiv: http://arxiv.org/abs/2507.07999v1 Abstract: Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a d...2025-07-1220 minDaily Paper CastDaily Paper Cast4KAgent: Agentic Any Image to 4K Super-Resolution ๐Ÿค— Upvotes: 56 | cs.CV, eess.IV Authors: Yushen Zuo, Qi Zheng, Mingyang Wu, Xinrui Jiang, Renjie Li, Jian Wang, Yide Zhang, Gengchen Mai, Lihong V. Wang, James Zou, Xiaoyu Wang, Ming-Hsuan Yang, Zhengzhong Tu Title: 4KAgent: Agentic Any Image to 4K Super-Resolution Arxiv: http://arxiv.org/abs/2507.07105v1 Abstract: We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly dis...2025-07-1126 minDaily Paper CastDaily Paper CastMemOS: A Memory OS for AI System ๐Ÿค— Upvotes: 83 | cs.CL Authors: Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Junpeng Ren, Huayi Lai, Hao Wu, Bo Tang, Zhenren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan Yang, Tong Xu, Wei Xu, Huajun Chen, Haofeng Wang, Hongkang Yang, Wentao Zhang, Zhi-Qin John Xu, Siheng Chen, Feiyu Xiong Title: MemOS: A Memory OS for AI System Arx...2025-07-0921 minDaily Paper CastDaily Paper CastKwai Keye-VL Technical Report ๐Ÿค— Upvotes: 97 | cs.CV Authors: Kwai Keye Team, Biao Yang, Bin Wen, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Hao Peng, Haojie Ding, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Jin Ouyang, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yang Zhou, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zhenhua Wu, Zhenyu Li, Zhixin Ling, Ziming Li, Dehua Ma, Di Xu, Haixuan Gao, Hang Li, Jiawei Guo...2025-07-0422 minDaily Paper CastDaily Paper CastGLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning ๐Ÿค— Upvotes: 141 | cs.CV, cs.AI, cs.LG Authors: GLM-V Team, :, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Sha...2025-07-0324 minDaily Paper CastDaily Paper CastVRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos ๐Ÿค— Upvotes: 29 | cs.CV, cs.AI, cs.MM Authors: Jiashuo Yu, Yue Wu, Meng Chu, Zhifei Ren, Zizheng Huang, Pei Chu, Ruijie Zhang, Yinan He, Qirui Li, Songze Li, Zhenxiang Li, Zhongying Tu, Conghui He, Yu Qiao, Yali Wang, Yi Wang, Limin Wang Title: VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos Arxiv: http://arxiv.org/abs/2506.10857v1 Abstract: We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and pro...2025-06-1422 minDaily Paper CastDaily Paper CastAutoregressive Semantic Visual Reconstruction Helps VLMs Understand Better ๐Ÿค— Upvotes: 26 | cs.CV, cs.AI, cs.CL Authors: Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang Title: Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better Arxiv: http://arxiv.org/abs/2506.09040v1 Abstract: Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vis...2025-06-1220 minDaily Paper CastDaily Paper CastMiniCPM4: Ultra-Efficient LLMs on End Devices ๐Ÿค— Upvotes: 60 | cs.CL, cs.AI Authors: MiniCPM Team, Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, Haotian Chen, Wentong Chen, Xin Cong, Ganqu Cui, Ning Ding, Shengdan Fan, Yewei Fang, Zixuan Fu, Wenyu Guan, Yitong Guan, Junshao Guo, Yufeng Han, Bingxiang He, Yuxiang Huang, Cunliang Kong, Qiuzuo Li, Siyuan Li, Wenhao Li, Yanghao Li, Yishan Li, Zhen Li, Dan Liu, Biyuan Lin, Yankai Lin, Xiang Long, Quanyu Lu, Yaxi Lu, Peiyan Luo, Hongya Lyu, Litu Ou, Yinxu Pan, Zekai Qu, Qundong Shi, Zijun Song, Jiayuan Su, Zhou Su, Ao Sun, Xianghui Sun, Peijun Tang, Fan...2025-06-1120 minDaily Paper CastDaily Paper CastAstra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning ๐Ÿค— Upvotes: 25 | cs.RO, cs.AI Authors: Sheng Chen, Peiyu He, Jiaxin Hu, Ziyang Liu, Yansheng Wang, Tao Xu, Chi Zhang, Chongchong Zhang, Chao An, Shiyu Cai, Duo Cao, Kangping Chen, Shuai Chu, Tianwei Chu, Mingdi Dan, Min Du, Weiwei Fang, Pengyou Fu, Junkai Hu, Xiaowei Jiang, Zhaodi Jiang, Fuxuan Li, Jun Li, Minghui Li, Mingyao Li, Yanchang Li, Zhibin Li, Guangming Liu, Kairui Liu, Lihao Liu, Weizhi Liu, Xiaoshun Liu, Yufei Liu, Yunfei Liu, Qiang Lu, Yuanfei Luo, Xiang Lv, Hongying Ma, Sai Ma, Lingxian Mi, Sha Sa, Hongxiang Shu, Lei Tian, Chengzhi Wang, Jiayu Wang, Kai...2025-06-1121 minDaily Paper CastDaily Paper CastMiMo-VL Technical Report ๐Ÿค— Upvotes: 58 | cs.CL Authors: Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shi...2025-06-0619 minDaily Paper CastDaily Paper CastShifting AI Efficiency From Model-Centric to Data-Centric Compression ๐Ÿค— Upvotes: 124 | cs.CL, cs.AI, cs.CV Authors: Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang Title: Shifting AI Efficiency From Model-Centric to Data-Centric Compression Arxiv: http://arxiv.org/abs/2505.19147v1 Abstract: The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gai...2025-05-2822 minDaily Paper CastDaily Paper CastScaling Computer-Use Grounding via User Interface Decomposition and Synthesis ๐Ÿค— Upvotes: 34 | cs.AI, cs.CL, cs.CV, cs.HC Authors: Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong Title: Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Arxiv: http://arxiv.org/abs/2505.13227v1 Abstract: Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks ove...2025-05-2122 minDaily Paper CastDaily Paper CastQwen3 Technical Report ๐Ÿค— Upvotes: 117 | cs.CL Authors: An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xin...2025-05-2021 minDaily Paper CastDaily Paper CastSeed1.5-VL Technical Report ๐Ÿค— Upvotes: 86 | cs.CV, cs.AI Authors: Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Sijin Wu, Tianheng Cheng, Weiwei Liu, Wenqian Wang, Xianhan Zeng, Xiao Liu, Xiaobo Qin, Xiaohan Ding, Xiaojun Xiao, Xiaoying Zhang, Xuanwei Zhang, Xuehan Xiong, Yanghua Peng, Yangrui Chen, Yanwei Li, Yanxu Hu, Yi Lin, Yiyuan Hu, Yiyuan Zhang, Youbin Wu, Yu Li, Yudong Liu, Yue...2025-05-1420 minDaily Paper CastDaily Paper CastMiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining ๐Ÿค— Upvotes: 53 | cs.CL, cs.AI, cs.LG Authors: Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shu...2025-05-1421 minDaily Paper CastDaily Paper CastPerception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models ๐Ÿค— Upvotes: 79 | cs.CV, cs.CL Authors: Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang Title: Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models Arxiv: http://arxiv.org/abs/2505.04921v1 Abstract: Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize acr...2025-05-1023 minDaily Paper CastDaily Paper CastUnified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning ๐Ÿค— Upvotes: 67 | cs.CV Authors: Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang Title: Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning Arxiv: http://arxiv.org/abs/2505.03318v1 Abstract: Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long cha...2025-05-0822 minDaily Paper CastDaily Paper CastRM-R1: Reward Modeling as Reasoning ๐Ÿค— Upvotes: 48 | cs.CL, cs.AI, cs.LG Authors: Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji Title: RM-R1: Reward Modeling as Reasoning Arxiv: http://arxiv.org/abs/2505.02387v1 Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences, especially through reinforcement learning from human feedback (RLHF). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or...2025-05-0723 minDaily Paper CastDaily Paper CastReinforcement Learning for Reasoning in Large Language Models with One Training Example ๐Ÿค— Upvotes: 49 | cs.LG, cs.AI, cs.CL Authors: Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Lucas Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen Title: Reinforcement Learning for Reasoning in Large Language Models with One Training Example Arxiv: http://arxiv.org/abs/2504.20571v1 Abstract: We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the...2025-05-0122 minDaily Paper CastDaily Paper CastThe Bitter Lesson Learned from 2,000+ Multilingual Benchmarks ๐Ÿค— Upvotes: 51 | cs.CL Authors: Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang Title: The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks Arxiv: http://arxiv.org/abs/2504.15521v1 Abstract: As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal tha...2025-04-2422 minDaily Paper CastDaily Paper CastInternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models ๐Ÿค— Upvotes: 172 | cs.CV Authors: Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Dengnian Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, Yingtong Xiong, Wenwen Qu, Peng Sun, Penglong Jiao, Han Lv, Lijun Wu, Kaipeng Zhang, Huipeng Deng, Jiaye Ge, Kai Chen, Limin Wang, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dah...2025-04-1622 minDaily Paper CastDaily Paper CastSeaweed-7B: Cost-Effective Training of Video Generation Foundation Model ๐Ÿค— Upvotes: 83 | cs.CV, cs.AI Authors: Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, Feng Cheng, Feilong Zuo Xuejiao Zeng, Ziyan Yang, Fangyuan Kong, Zhiwu Qing, Fei Xiao, Meng Wei, Tuyen Hoang, Siyu Zhang, Peihao Zhu, Qi Zhao, Jiangqiao Yan, Liangke Gui, Sheng Bi, Jiashi Li, Yuxi Ren, Rui Wang, Huixia Li, Xuefeng Xiao, Shu Liu, Feng Ling, Heng Zhang, Houmin Wei, Huafeng Kuang, Jerry Duncan, Junda Zhang, Junru Zheng, Li Sun, Manlin Zhang, Renfei Sun, Xiaobin Zhuang, Xiaojie Li, Xin Xia, Xuyan Chi, Yan...2025-04-1522 minDaily Paper CastDaily Paper CastKimi-VL Technical Report ๐Ÿค— Upvotes: 71 | cs.CV Authors: Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo...2025-04-1223 minDaily Paper CastDaily Paper CastA Unified Agentic Framework for Evaluating Conditional Image Generation ๐Ÿค— Upvotes: 25 | cs.CV, cs.CL Authors: Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang Title: A Unified Agentic Framework for Evaluating Conditional Image Generation Arxiv: http://arxiv.org/abs/2504.07046v1 Abstract: Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes lar...2025-04-1121 minDaily Paper CastDaily Paper CastCOIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values ๐Ÿค— Upvotes: 36 | cs.CL Authors: M-A-P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin Title: COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values Arxiv: http://arxiv.org/abs/2504.05535v1 A...2025-04-1021 minDaily Paper CastDaily Paper CastAdvances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems ๐Ÿค— Upvotes: 98 | cs.AI Authors: Bang Liu, Xinfeng Li, Jiayi Zhang, Jinlin Wang, Tanjin He, Sirui Hong, Hongzhang Liu, Shaokun Zhang, Kaitao Song, Kunlun Zhu, Yuheng Cheng, Suyuchen Wang, Xiaoqiang Wang, Yuyu Luo, Haibo Jin, Peiyan Zhang, Ollie Liu, Jiaqi Chen, Huan Zhang, Zhaoyang Yu, Haochen Shi, Boyan Li, Dekun Wu, Fengwei Teng, Xiaojun Jia, Jiawei Xu, Jinyu Xiang, Yizhang Lin, Tianming Liu, Tongliang Liu, Yu Su, Huan Sun, Glen Berseth, Jianyun Nie, Ian Foster, Logan Ward, Qingyun Wu, Yu Gu, Mingchen Zhuge, Xiangru Tang, Haohan Wang, Jiaxuan You, Chi Wang, Jian Pei, Qiang Yang, Xiaoliang Qi, Che...2025-04-0520 minDaily Paper CastDaily Paper CastEgoLife: Towards Egocentric Life Assistant ๐Ÿค— Upvotes: 21 | cs.CV Authors: Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, Ziwei Liu Title: EgoLife: Towards Egocentric Life Assistant Arxiv: http://arxiv.org/abs/2503.03803v1 Abstract: We introduce EgoLife, a project to develop an egocentric life assistant that accompanies and enhances personal efficiency through AI-powered wearable glasses. To lay the foundation for thi...2025-03-0822 minDaily Paper CastDaily Paper CastPhi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs ๐Ÿค— Upvotes: 42 | cs.CL, cs.AI, cs.LG Authors: Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, Dong Chen, Dongdong Chen, Junkun Chen, Weizhu Chen, Yen-Chun Chen, Yi-ling Chen, Qi Dai, Xiyang Dai, Ruchao Fan, Mei Gao, Min Gao, Amit Garg, Abhishek Goswami, Junheng Hao, Amr Hendy, Yuxuan Hu, Xin Jin, Mahmoud Khademi, Dongwoo Kim, Young Jin Kim, Gina Lee, Jinyu Li, Yunsheng Li, Chen Liang, Xihui Lin, Zeqi Lin, Mengchen Liu, Yang Liu, Gilsinia Lopez, Chong Luo, Piyush Madan, Vadim Mazalov, Ali Mousavi, Anh Ngu...2025-03-0525 minDaily Paper CastDaily Paper CastSuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines ๐Ÿค— Upvotes: 81 | cs.CL Authors: M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li...2025-02-2224 minDaily Paper CastDaily Paper CastQwen2.5-VL Technical Report ๐Ÿค— Upvotes: 97 | cs.CV, cs.CL Authors: Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, Junyang Lin Title: Qwen2.5-VL Technical Report Arxiv: http://arxiv.org/abs/2502.13923v1 Abstract: We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant adv...2025-02-2121 minDaily Paper CastDaily Paper CastAutellix: An Efficient Serving Engine for LLM Agents as General Programs ๐Ÿค— Upvotes: 15 | cs.LG, cs.AI, cs.DC Authors: Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, Ion Stoica Title: Autellix: An Efficient Serving Engine for LLM Agents as General Programs Arxiv: http://arxiv.org/abs/2502.13965v1 Abstract: Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ign...2025-02-2122 minDaily Paper CastDaily Paper CastNative Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention ๐Ÿค— Upvotes: 68 | cs.CL, cs.AI, cs.LG Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng Title: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention Arxiv: http://arxiv.org/abs/2502.11089v1 Abstract: Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while mai...2025-02-1923 minDaily Paper CastDaily Paper CastThe Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks ๐Ÿค— Upvotes: 41 | cs.AI Authors: Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez Title: The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks Arxiv: http://arxiv.org/abs/2502.08235v1 Abstract: Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited. This paper introduces and analyzes overthinking in LRMs. A phenomenon whe...2025-02-1827 minDaily Paper CastDaily Paper CastStep-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model ๐Ÿค— Upvotes: 38 | cs.CV, cs.CL Authors: Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Hen...2025-02-1823 minDaily Paper CastDaily Paper CastMM-RLHF: The Next Step Forward in Multimodal LLM Alignment ๐Ÿค— Upvotes: 22 | cs.CL, cs.CV Authors: Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan Title: MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Arxiv: http://arxiv.org/abs/2502.10391v1 Abstract: Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because cur...2025-02-1823 minDaily Paper CastDaily Paper CastEmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents ๐Ÿค— Upvotes: 20 | cs.AI, cs.CL, cs.CV Authors: Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang Title: EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents Arxiv: http://arxiv.org/abs/2502.09560v1 Abstract: Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the...2025-02-1521 minDaily Paper CastDaily Paper CastExploring the Potential of Encoder-free Architectures in 3D LMMs ๐Ÿค— Upvotes: 17 | cs.CV, cs.AI, cs.CL Authors: Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao Title: Exploring the Potential of Encoder-free Architectures in 3D LMMs Arxiv: http://arxiv.org/abs/2502.09620v1 Abstract: Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of enc...2025-02-1518 minDaily Paper CastDaily Paper CastCoSER: Coordinating LLM-Based Persona Simulation of Established Roles ๐Ÿค— Upvotes: 16 | cs.CL, cs.AI Authors: Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Wei Wang, Yanghua Xiao, Shuchang Zhou Title: CoSER: Coordinating LLM-Based Persona Simulation of Established Roles Arxiv: http://arxiv.org/abs/2502.09082v1 Abstract: Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this pap...2025-02-1522 minDaily Paper CastDaily Paper CastTextAtlas5M: A Large-scale Dataset for Dense Text Image Generation ๐Ÿค— Upvotes: 35 | cs.CV Authors: Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li Title: TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation Arxiv: http://arxiv.org/abs/2502.07870v1 Abstract: Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and...2025-02-1418 minDaily Paper CastDaily Paper CastCineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation ๐Ÿค— Upvotes: 29 | cs.CV Authors: Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai Title: CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation Arxiv: http://arxiv.org/abs/2502.08639v1 Abstract: In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and...2025-02-1423 minDaily Paper CastDaily Paper CastEnhance-A-Video: Better Generated Video for Free ๐Ÿค— Upvotes: 14 | cs.CV Authors: Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You Title: Enhance-A-Video: Better Generated Video for Free Arxiv: http://arxiv.org/abs/2502.07508v1 Abstract: DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its sim...2025-02-1320 minDaily Paper CastDaily Paper CastScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization ๐Ÿค— Upvotes: 12 | cs.CL Authors: Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam Title: ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization Arxiv: http://arxiv.org/abs/2502.04306v1 Abstract: Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address the...2025-02-0820 minDaily Paper CastDaily Paper CastProcess Reinforcement through Implicit Rewards ๐Ÿค— Upvotes: 44 | cs.LG, cs.AI, cs.CL Authors: Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding Title: Process Reinforcement through Implicit Rewards Arxiv: http://arxiv.org/abs/2502.01456v1 Abstract: Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large lan...2025-02-0521 minDaily Paper CastDaily Paper CastAlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding ๐Ÿค— Upvotes: 25 | cs.CL Authors: Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-Andrรฉ Noรซl, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar Title: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding Arxiv: http://arxiv.org/abs/2502.01341v1 Abstract: Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of su...2025-02-0523 minDaily Paper CastDaily Paper CastSafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model ๐Ÿค— Upvotes: 25 | cs.CR, cs.AI, cs.IR Authors: Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang, Shichao Song, Mengwei Wang, Jiawei Yang Title: SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model Arxiv: http://arxiv.org/abs/2501.18636v1 Abstract: The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because att...2025-02-0523 minDaily Paper CastDaily Paper CastThoughts Are All Over the Place: On the Underthinking of o1-Like LLMs ๐Ÿค— Upvotes: 22 | cs.CL Authors: Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu Title: Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs Arxiv: http://arxiv.org/abs/2501.18585v1 Abstract: Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1...2025-02-0123 minDaily Paper CastDaily Paper CastEmilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation ๐Ÿค— Upvotes: 11 | cs.SD, cs.CL, eess.AS Authors: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu Title: Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation Arxiv: http://arxiv.org/abs/2501.15907v1 Abstract: Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their rel...2025-01-2922 minDaily Paper CastDaily Paper CastHumanity's Last Exam ๐Ÿค— Upvotes: 33 | cs.LG, cs.AI, cs.CL Authors: Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan Kumar, Oleksandr Pokutnyi, Robert Gerbicz, Serguei Popov, John-Clark Levin, Mstyslav Kazakov, Johannes Schmitt, Geoff Galgon, Alvaro Sanchez, Yongki Lee, Will Yeadon, Scott Sauers, Marc Roth, Chidozie Agu, Sรธren Riis, Fabian Giska, Saiteja Utpa...2025-01-2822 minDaily Paper CastDaily Paper CastImproving Video Generation with Human Feedback ๐Ÿค— Upvotes: 30 | cs.CV, cs.AI, cs.GR, cs.LG Authors: Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang Title: Improving Video Generation with Human Feedback Arxiv: http://arxiv.org/abs/2501.13918v1 Abstract: Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a s...2025-01-2524 minDaily Paper CastDaily Paper CastTemporal Preference Optimization for Long-Form Video Understanding ๐Ÿค— Upvotes: 15 | cs.CV, cs.AI, cs.CL, cs.LG, cs.RO Authors: Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy Title: Temporal Preference Optimization for Long-Form Video Understanding Arxiv: http://arxiv.org/abs/2501.13919v1 Abstract: Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a s...2025-01-2524 minDaily Paper CastDaily Paper CastOne-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt ๐Ÿค— Upvotes: 5 | cs.CV, cs.AI, cs.LG Authors: Tao Liu, Kai Wang, Senmao Li, Joost van de Weijer, Fahad Shahbaz Khan, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng Title: One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt Arxiv: http://arxiv.org/abs/2501.13554v1 Abstract: Text-to-image generation models can create high-quality images from input prompts. However, they struggle to support the consistent generation of identity-preserving requirements for storytelling. Existing approaches to this problem typically require extensive training in large datasets or additional modifications to the original mod...2025-01-2522 minDaily Paper CastDaily Paper CastDeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning ๐Ÿค— Upvotes: 109 | cs.CL, cs.AI, cs.LG Authors: DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Hua...2025-01-2421 minDaily Paper CastDaily Paper CastFilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces ๐Ÿค— Upvotes: 43 | cs.CL, cs.GR, cs.MA Authors: Zhenran Xu, Longyue Wang, Jifang Wang, Zhouyi Li, Senbao Shi, Xue Yang, Yiyu Wang, Baotian Hu, Jun Yu, Min Zhang Title: FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces Arxiv: http://arxiv.org/abs/2501.12909v1 Abstract: Virtual film production requires intricate decision-making processes, including scriptwriting, virtual cinematography, and precise actor positioning and actions. Motivated by recent advances in automated decision-making with language agent-based societies, this paper introduces FilmAgent, a novel LLM-based multi-agent collaborative framework for end...2025-01-2424 minDaily Paper CastDaily Paper CastKimi k1.5: Scaling Reinforcement Learning with LLMs ๐Ÿค— Upvotes: 39 | cs.AI, cs.LG Authors: Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haozhen Yu, Hongcheng Gao, Huabin Zheng, Huan Yuan, Jia Chen, Jianhang Guo, Jianlin Su, Jianzhou Wang, Jie Zhao, Jin Zhang, Jingyuan Liu, Junjie Yan, Junyan Wu, Lidong Shi, Ling Ye, Longhui Yu, Men...2025-01-2418 minDaily Paper CastDaily Paper CastUI-TARS: Pioneering Automated GUI Interaction with Native Agents ๐Ÿค— Upvotes: 31 | cs.AI, cs.CL, cs.CV, cs.HC Authors: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi Title: UI-TARS: Pioneering Automated GUI Interaction with Native Agents Arxiv: htt...2025-01-2320 minDaily Paper CastDaily Paper CastMobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks ๐Ÿค— Upvotes: 20 | cs.CL, cs.CV Authors: Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji Title: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks Arxiv: http://arxiv.org/abs/2501.11733v1 Abstract: Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world hum...2025-01-2323 minDaily Paper CastDaily Paper CastHunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation ๐Ÿค— Upvotes: 16 | cs.CV Authors: Zibo Zhao, Zeqiang Lai, Qingxiang Lin, Yunfei Zhao, Haolin Liu, Shuhui Yang, Yifei Feng, Mingxin Yang, Sheng Zhang, Xianghui Yang, Huiwen Shi, Sicong Liu, Junta Wu, Yihang Lian, Fan Yang, Ruining Tang, Zebin He, Xinzhou Wang, Jian Liu, Xuhui Zuo, Zhuo Chen, Biwen Lei, Haohan Weng, Jing Xu, Yiling Zhu, Xinhai Liu, Lixin Xu, Changrong Hu, Tianyu Huang, Lifu Wang, Jihong Zhang, Meng Chen, Liang Dong, Yiwen Jia, Yulin Cai, Jiaao Yu, Yixuan Tang, Hao Zhang, Zheng Ye, Peng He, Runzhou Wu, Chao Zhang, Yonghao Tan, Jie Xiao, Yangyu Tao, Jianchen Zhu, Jin...2025-01-2320 minDaily Paper CastDaily Paper CastMinMo: A Multimodal Large Language Model for Seamless Voice Interaction ๐Ÿค— Upvotes: 21 | cs.CL, cs.AI, cs.HC, cs.SD, eess.AS Authors: Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou Title: MinMo: A Multimodal Large Language Model for Seamless Voi...2025-01-1523 minDaily Paper CastDaily Paper CastConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning ๐Ÿค— Upvotes: 10 | cs.CV Authors: Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai Title: ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning Arxiv: http://arxiv.org/abs/2501.04698v1 Abstract: Text-to-video generation has made remarkable advancements through diffusion models. However, Multi-Concept Video Customization (MCVC) remains a significant challenge. We identify two key challenges in this task: 1) the identity decoupling problem, where directly adopting existing customization methods inevitably mix attributes when handling multiple concepts simultaneously, and 2) the sca...2025-01-1423 minDaily Paper CastDaily Paper CastMotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models ๐Ÿค— Upvotes: 32 | cs.CV Authors: Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang Title: MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Arxiv: http://arxiv.org/abs/2501.02955v1 Abstract: In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion com...2025-01-0922 minDaily Paper CastDaily Paper CastCosmos World Foundation Model Platform for Physical AI ๐Ÿค— Upvotes: 31 | cs.CV, cs.AI, cs.LG, cs.RO Authors: NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, Seung Wook Kim, Gergely Klรกr, Grace Lam, Shiyi Lan, Laura Leal-Taixe, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Huan Ling, Ming-Yu Liu, Xian Liu, Alice Luo, Qianli Ma, Hanzi Mao, Kaichun Mo, Arsalan Mous...2025-01-0925 minDaily Paper CastDaily Paper CastTransPixar: Advancing Text-to-Video Generation with Transparency ๐Ÿค— Upvotes: 9 | cs.CV Authors: Luozhou Wang, Yijun Li, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yingcong Chen Title: TransPixar: Advancing Text-to-Video Generation with Transparency Arxiv: http://arxiv.org/abs/2501.03006v1 Abstract: Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and ref...2025-01-0822 minDaily Paper CastDaily Paper CastVirgo: A Preliminary Exploration on Reproducing o1-like MLLM ๐Ÿค— Upvotes: 12 | cs.CV, cs.AI Authors: Yifan Du, Zikang Liu, Yifan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, Weipeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen Title: Virgo: A Preliminary Exploration on Reproducing o1-like MLLM Arxiv: http://arxiv.org/abs/2501.01904v1 Abstract: Recently, slow-thinking reasoning systems, built upon large language models (LLMs), have garnered widespread attention by scaling the thinking time during inference. There is also growing interest in adapting this capability to multimodal large language models (MLLMs). Given that MLLMs handle more complex data sem...2025-01-0722 minDaily Paper CastDaily Paper CastOn the Compositional Generalization of Multimodal LLMs for Medical Imaging ๐Ÿค— Upvotes: 29 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Zhenyang Cai, Junying Chen, Rongsheng Wang, Weihong Wang, Yonglin Deng, Dingjie Song, Yize Chen, Zixu Zhang, Benyou Wang Title: On the Compositional Generalization of Multimodal LLMs for Medical Imaging Arxiv: http://arxiv.org/abs/2412.20070v1 Abstract: Multimodal large language models (MLLMs) hold significant potential in the medical field, but their capabilities are often limited by insufficient data in certain medical domains, highlighting the need for understanding what kinds of images can be used by MLLMs for generalization. Cur...2025-01-0122 minDaily Paper CastDaily Paper CastHuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs ๐Ÿค— Upvotes: 53 | cs.CL, cs.AI, cs.LG Authors: Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, Benyou Wang Title: HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs Arxiv: http://arxiv.org/abs/2412.18925v1 Abstract: The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of...2024-12-3123 minDaily Paper CastDaily Paper CastTask Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment ๐Ÿค— Upvotes: 11 | cs.CV Authors: Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang Title: Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment Arxiv: http://arxiv.org/abs/2412.19326v1 Abstract: Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the aut...2024-12-3125 minDaily Paper CastDaily Paper CastDepthLab: From Partial to Complete ๐Ÿค— Upvotes: 21 | cs.CV Authors: Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo Title: DepthLab: From Partial to Complete Arxiv: http://arxiv.org/abs/2412.18153v1 Abstract: Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it...2024-12-2622 minDaily Paper CastDaily Paper CastMotiF: Making Text Count in Image Animation with Motion Focal Loss ๐Ÿค— Upvotes: 3 | cs.CV, cs.AI Authors: Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin Title: MotiF: Making Text Count in Image Animation with Motion Focal Loss Arxiv: http://arxiv.org/abs/2412.16153v1 Abstract: Text-Image-to-Video (TI2V) generation aims to generate a video from an image following a text description, which is also referred to as text-guided image animation. Most existing methods struggle to generate videos that align well with the text prompts, particularly when motion is specified. To overcome this limitation, we introduce Mot...2024-12-2622 minDaily Paper CastDaily Paper CastOpenAI o1 System Card ๐Ÿค— Upvotes: 12 | cs.AI Authors: OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich, Andrey Mishchenko, Andy Applebaum, Angela Jiang, Ashvin Nair, Barret Zoph, Behrooz Ghorbani, Ben Rossen, Benjamin Sokolowsky, Boaz Barak, Bob McGrew, Borys Minaiev, Botao Hao, Bowen Baker, Brandon Houghton, Brandon McKinzie, Brydon Eastman, Camillo Lugaresi, Cary Bassin, Cary Hudson, Chak Ming Li, Charles de Bourcy, Che...2024-12-2525 minDaily Paper CastDaily Paper CastOutcome-Refining Process Supervision for Code Generation ๐Ÿค— Upvotes: 11 | cs.CL, cs.AI, cs.LG, cs.SE Authors: Zhuohao Yu, Weizheng Gu, Yidong Wang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang Title: Outcome-Refining Process Supervision for Code Generation Arxiv: http://arxiv.org/abs/2412.15118v1 Abstract: Large Language Models have demonstrated remarkable capabilities in code generation, yet they often struggle with complex programming tasks that require deep algorithmic reasoning. While process supervision through learned reward models shows promise in guiding reasoning steps, it requires expensive training data and suffers from unreliable evaluation. We propose Outcome-Refining Pro...2024-12-2521 minDaily Paper CastDaily Paper CastLearnLM: Improving Gemini for Learning ๐Ÿค— Upvotes: 9 | cs.CY, cs.AI, cs.LG Authors: LearnLM Team, Abhinit Modi, Aditya Srikanth Veerubhotla, Aliya Rysbek, Andrea Huber, Brett Wiltshire, Brian Veprek, Daniel Gillick, Daniel Kasenberg, Derek Ahmed, Irina Jurenka, James Cohan, Jennifer She, Julia Wilkowski, Kaiz Alarakyia, Kevin McKee, Lisa Wang, Markus Kunesch, Mike Schaekermann, Miruna Pรฎslar, Nikhil Joshi, Parsa Mahmoudieh, Paul Jhun, Sara Wiltberger, Shakir Mohamed, Shashank Agarwal, Shubham Milind Phal, Sun Jae Lee, Theofilos Strinopoulos, Wei-Jen Ko, Amy Wang, Ankit Anand, Avishkar Bhoopchand, Dan Wild, Divya Pandya, Filip Bar, Garth Graham, Holger Winnemoeller, Mahvish Nagda, Prateek Kolhar, Renee Schneider, Shaojian Zhu, Step...2024-12-2527 minDaily Paper CastDaily Paper CastLeviTor: 3D Trajectory Oriented Image-to-Video Synthesis ๐Ÿค— Upvotes: 12 | cs.CV Authors: Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang Title: LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis Arxiv: http://arxiv.org/abs/2412.15214v1 Abstract: The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, suc...2024-12-2121 minDaily Paper CastDaily Paper CastAniDoc: Animation Creation Made Easier ๐Ÿค— Upvotes: 29 | cs.CV Authors: Yihao Meng, Hao Ouyang, Hanlin Wang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Zhiheng Liu, Yujun Shen, Huamin Qu Title: AniDoc: Animation Creation Made Easier Arxiv: http://arxiv.org/abs/2412.14173v1 Abstract: The production of 2D animation follows an industry-standard workflow, encompassing four essential stages: character design, keyframe animation, in-betweening, and coloring. Our research focuses on reducing the labor costs in the above process by harnessing the potential of increasingly powerful generative AI. Using video diffusion models as the foundation, AniDoc emerges as a v...2024-12-2022 minDaily Paper CastDaily Paper CastSPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models ๐Ÿค— Upvotes: 11 | cs.CL, cs.AI, cs.LG Authors: Jiale Cheng, Xiao Liu, Cunxiang Wang, Xiaotao Gu, Yida Lu, Dan Zhang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang Title: SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models Arxiv: http://arxiv.org/abs/2412.11605v1 Abstract: Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference lea...2024-12-1823 minDaily Paper CastDaily Paper CastApollo: An Exploration of Video Understanding in Large Multimodal Models ๐Ÿค— Upvotes: 91 | cs.CV, cs.AI Authors: Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia Title: Apollo: An Exploration of Video Understanding in Large Multimodal Models Arxiv: http://arxiv.org/abs/2412.10360v1 Abstract: Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high com...2024-12-1724 minDaily Paper CastDaily Paper CastSynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding ๐Ÿค— Upvotes: 29 | cs.CV Authors: Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai Title: SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding Arxiv: http://arxiv.org/abs/2412.09604v1 Abstract: The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. How...2024-12-1725 minDaily Paper CastDaily Paper CastMultimodal Latent Language Modeling with Next-Token Diffusion ๐Ÿค— Upvotes: 21 | cs.CL, cs.CV, cs.LG Authors: Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei Title: Multimodal Latent Language Modeling with Next-Token Diffusion Arxiv: http://arxiv.org/abs/2412.08635v1 Abstract: Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a var...2024-12-1422 minDaily Paper CastDaily Paper CastAgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials ๐Ÿค— Upvotes: 16 | cs.CL Authors: Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu Title: AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials Arxiv: http://arxiv.org/abs/2412.09605v1 Abstract: Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive hum...2024-12-1418 minDaily Paper CastDaily Paper CastLiFT: Leveraging Human Feedback for Text-to-Video Model Alignment ๐Ÿค— Upvotes: 33 | cs.CV Authors: Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin, Hao Li Title: LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment Arxiv: http://arxiv.org/abs/2412.04814v1 Abstract: Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a n...2024-12-1020 minDaily Paper CastDaily Paper CastAguvis: Unified Pure Vision Agents for Autonomous GUI Interaction ๐Ÿค— Upvotes: 32 | cs.CL Authors: Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong Title: Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction Arxiv: http://arxiv.org/abs/2412.04454v1 Abstract: Graphical User Interfaces (GUIs) are critical to human-computer interaction, yet automating GUI tasks remains challenging due to the complexity and variability of visual environments. Existing approaches often rely on textual representations of GUIs, which introduce limitations in generalization, efficiency, and scalability. In this paper, we introduce Aguvis, a uni...2024-12-0820 minDaily Paper CastDaily Paper CastCode-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection ๐Ÿค— Upvotes: 32 | cs.RO, cs.AI, cs.CV, cs.LG Authors: Enshen Zhou, Qi Su, Cheng Chi, Zhizheng Zhang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, He Wang Title: Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection Arxiv: http://arxiv.org/abs/2412.04455v1 Abstract: Automatic detection and prevention of open-set failures are crucial in closed-loop robotic systems. Recent studies often struggle to simultaneously identify unexpected failures reactively after they occur and prevent foreseeable ones proactively. To this end, we propose Code-as-Monitor (CaM), a novel paradigm leveraging the...2024-12-0822 minDaily Paper CastDaily Paper CastMaterial Anything: Generating Materials for Any 3D Object via Diffusion ๐Ÿค— Paper Upvotes: 33 | cs.CV, cs.GR Authors: Xin Huang, Tengfei Wang, Ziwei Liu, Qing Wang Title: Material Anything: Generating Materials for Any 3D Object via Diffusion Arxiv: http://arxiv.org/abs/2411.15138v1 Abstract: We present Material Anything, a fully-automated, unified diffusion framework designed to generate physically-based materials for 3D objects. Unlike existing methods that rely on complex pipelines or case-specific optimizations, Material Anything offers a robust, end-to-end solution adaptable to objects under diverse lighting conditions. Our approach leverages a pre-trained image diffusion model, enhanced with a triple-head arc...2024-11-2721 minDaily Paper CastDaily Paper CastEnhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization ๐Ÿค— Paper Upvotes: 42 | cs.CL, cs.CV Authors: Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, Jifeng Dai Title: Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Arxiv: http://arxiv.org/abs/2411.10442v1 Abstract: Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning. However, these models suffer from distribution shifts, which limit their multimodal reasoning, particularly in the Chain-of-Thought (CoT) performance. To address thi...2024-11-2319 minDaily Paper CastDaily Paper CastMarco-o1: Towards Open Reasoning Models for Open-Ended Solutions ๐Ÿค— Paper Upvotes: 23 | cs.CL Authors: Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang Title: Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions Arxiv: http://arxiv.org/abs/2411.14405v1 Abstract: Currently OpenAI o1 has sparked a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- which are well-suited for reinforcement learning (RL) -- but also places gre...2024-11-2319 minDaily Paper CastDaily Paper CastVBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models ๐Ÿค— Paper Upvotes: 23 | cs.CV Authors: Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu Title: VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models Arxiv: http://arxiv.org/abs/2411.13503v1 Abstract: Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully ali...2024-11-2224 minDaily Paper CastDaily Paper CastSEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning ๐Ÿค— Paper Upvotes: 5 | cs.CV Authors: Zewen Chen, Juan Wang, Wen Wang, Sunhan Xu, Hang Xiong, Yun Zeng, Jian Guo, Shuxun Wang, Chunfeng Yuan, Bing Li, Weiming Hu Title: SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning Arxiv: http://arxiv.org/abs/2411.10161v1 Abstract: Existing Image Quality Assessment (IQA) methods achieve remarkable success in analyzing quality for overall image, but few works explore quality analysis for Regions of Interest (ROIs). The quality analysis of ROIs can provide fine-grained guidance for image quality improvement and...2024-11-2120 minDaily Paper CastDaily Paper CastAnimateAnything: Consistent and Controllable Animation for Video Generation ๐Ÿค— Paper Upvotes: 12 | cs.CV Authors: Guojun Lei, Chi Wang, Hong Li, Rong Zhang, Yikai Wang, Weiwei Xu Title: AnimateAnything: Consistent and Controllable Animation for Video Generation Arxiv: http://arxiv.org/abs/2411.10836v1 Abstract: We present a unified controllable video generation approach AnimateAnything that facilitates precise and consistent video manipulation across various conditions, including camera trajectories, text prompts, and user motion annotations. Specifically, we carefully design a multi-scale control feature fusion network to construct a common motion representation for different conditions. It explicitly converts all control information into fra...2024-11-2022 minDaily Paper CastDaily Paper CastLLaMA-Mesh: Unifying 3D Mesh Generation with Language Models ๐Ÿค— Paper Upvotes: 32 | cs.LG, cs.AI, cs.CL, cs.CV, 68T05, I.3.5; I.2.10; I.2.6 Authors: Zhengyi Wang, Jonathan Lorraine, Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, Xiaohui Zeng Title: LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models Arxiv: http://arxiv.org/abs/2411.09595v1 Abstract: This work explores expanding the capabilities of large language models (LLMs) pretrained on text to generate 3D meshes within a unified model. This offers key advantages of (1) leveraging spatial knowledge already embedded in LLMs, derived from textual sources like 3D tutorials, and (2) ena...2024-11-1623 minDaily Paper CastDaily Paper CastMagicQuill: An Intelligent Interactive Image Editing System ๐Ÿค— Paper Upvotes: 31 | cs.CV Authors: Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Wen Wang, Zhiheng Liu, Qifeng Chen, Yujun Shen Title: MagicQuill: An Intelligent Interactive Image Editing System Arxiv: http://arxiv.org/abs/2411.09703v1 Abstract: Image editing involves a variety of complex tasks and requires efficient and precise manipulation techniques. In this paper, we present MagicQuill, an integrated image editing system that enables swift actualization of creative ideas. Our system features a streamlined yet functionally robust interface, allowing for the articulation of editing ope...2024-11-1620 minDaily Paper CastDaily Paper CastHtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems ๐Ÿค— Paper Upvotes: 34 | cs.IR Authors: Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, Ji-Rong Wen Title: HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems Arxiv: http://arxiv.org/abs/2411.02959v1 Abstract: Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities and alleviate the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG systems, and many commercial systems such as ChatGPT and Perplexity have used Web search engines as their major retrieval sys...2024-11-0721 minDaily Paper CastDaily Paper CastDeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution ๐Ÿค— Paper Upvotes: 10 | cs.RO, cs.AI, cs.LG Authors: Yang Yue, Yulin Wang, Bingyi Kang, Yizeng Han, Shenzhi Wang, Shiji Song, Jiashi Feng, Gao Huang Title: DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution Arxiv: http://arxiv.org/abs/2411.02359v1 Abstract: MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world rob...2024-11-0719 minDaily Paper CastDaily Paper CastTraining-free Regional Prompting for Diffusion Transformers ๐Ÿค— Paper Upvotes: 19 | cs.CV Authors: Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang Title: Training-free Regional Prompting for Diffusion Transformers Arxiv: http://arxiv.org/abs/2411.02395v1 Abstract: Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and...2024-11-0617 minDaily Paper CastDaily Paper CastHunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent ๐Ÿค— Paper Upvotes: 16 | cs.CL, cs.AI Authors: Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, Zhongzhi Chen, Xuemeng Huang, Fengzong Lian, Saiyong Yang, Jianfeng Yan, Yuyuan Zeng, Xiaoqin Ren, Chao Yu, Lulu Wu, Yue Mao, Jun Xia, Tao Yang, Suncong Zheng, Kan Wu, Dian Jiao, Jinbao Xue, Xipeng Zhang, Decheng Wu, Kai Liu, Dengpeng Wu, Guanghui Xu, Shaohua Chen, Shuang Chen, Xiao Feng, Yigeng Hong, Junqiang Zheng, Chengcheng Xu, Zongwei Li, Xiong Kuang, Jianglu Hu, Yiqi Chen, Yuchi Deng, Guiyang Li, Ao Liu...2024-11-0618 minDaily Paper CastDaily Paper CastGenXD: Generating Any 3D and 4D Scenes ๐Ÿค— Paper Upvotes: 13 | cs.CV, cs.AI Authors: Yuyang Zhao, Chung-Ching Lin, Kevin Lin, Zhiwen Yan, Linjie Li, Zhengyuan Yang, Jianfeng Wang, Gim Hee Lee, Lijuan Wang Title: GenXD: Generating Any 3D and 4D Scenes Arxiv: http://arxiv.org/abs/2411.02319v2 Abstract: Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by lev...2024-11-0622 minDaily Paper CastDaily Paper CastPersonalization of Large Language Models: A Survey ๐Ÿค— Paper Upvotes: 14 | cs.CL Authors: Zhehao Zhang, Ryan A. Rossi, Branislav Kveton, Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernoncourt, Joe Barrow, Tong Yu, Sungchul Kim, Ruiyi Zhang, Jiuxiang Gu, Tyler Derr, Hongjie Chen, Junda Wu, Xiang Chen, Zichao Wang, Subrata Mitra, Nedim Lipka, Nesreen Ahmed, Yu Wang Title: Personalization of Large Language Models: A Survey Arxiv: http://arxiv.org/abs/2411.00027v1 Abstract: Personalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications. Despite the importance and recent progress, most exi...2024-11-0525 minDaily Paper Cast (Test)Daily Paper Cast (Test)BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments๐Ÿค— Daily Paper Upvotes: 11 Authors: Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu Categories: cs.CL, cs.AI, cs.CV, cs.LG Arxiv: http://arxiv.org/abs/2410.23918v1 Title: BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments Abstract: Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression rat...2024-11-0317 min