new

Get trending papers in your email inbox once a day!

Get trending papers in your email inbox!

Daily Papers

by AK and the research community

Dec 3

Submitted by

Zery

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

·
9 authors

Submitted by

akhaliq

o1-Coder: an o1 Replication for Coding

·
7 authors

Submitted by

dbaranchuk

Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

·
5 authors

Submitted by

LanguageBind

Open-Sora Plan: Open-Source Large Video Generation Model

·
24 authors

Submitted by

akhaliq

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

·
3 authors

Submitted by

wren93

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

·
5 authors

Submitted by

caizhongang

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

·
10 authors

Submitted by

HYeungLee

TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

·
6 authors

Submitted by

kpzhang996

GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

·
18 authors

Submitted by

akhaliq

Efficient Track Anything

·
13 authors

Submitted by

mpatel57

Steering Rectified Flow Models in the Vector Field for Controlled Image Generation

·
4 authors

Submitted by

BK-Lee

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

·
5 authors

Submitted by

rubenohana

The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning

·
26 authors

Submitted by

horseee

TinyFusion: Diffusion Transformers Learned Shallow

·
4 authors

Submitted by

atcbosselut

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

·
59 authors

Submitted by

BestWishYsh

WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model

·
7 authors

Submitted by

Foreshhh

VLSBench: Unveiling Visual Leakage in Multimodal Safety

·
5 authors

Submitted by

Cakeyan

Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

·
6 authors

Submitted by

jomat

Art-Free Generative Models: Art Creation Without Graphic Art Knowledge

·
5 authors

Submitted by

ryokamoi

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

·
5 authors

Submitted by

zhangysk

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

·
10 authors

Submitted by

yanxi-chen

A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models

·
5 authors

Submitted by

amanchadha

Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting

·
8 authors

Submitted by

qihang

World-consistent Video Diffusion with Explicit 3D Modeling

·
7 authors

Submitted by

ftaioli

Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input

·
7 authors

Submitted by

hyzhou404

HUGSIM: A Real-Time, Photo-Realistic and Closed-Loop Simulator for Autonomous Driving

·
9 authors

Submitted by

cranial-xix

AMO Sampler: Enhancing Text Rendering with Overshooting

·
5 authors

Submitted by

amanchadha

Improving speaker verification robustness with synthetic emotional utterances

·
6 authors

Submitted by

callmesan

Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning

·
3 authors