Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2508.21058

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 45
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 24

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Paper • 2508.09789 • Published 25 days ago • 5
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Paper • 2508.13186 • Published 24 days ago • 17
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

Paper • 2508.04038 • Published Aug 6 • 1
Prompt Orchestration Markup Language

Paper • 2508.13948 • Published 19 days ago • 48

Representation & Optimization

Understanding about representation sheds light on optimization

Nuclear Norm Regularization for Deep Learning

Paper • 2405.14544 • Published May 23, 2024 • 1
Token embeddings violate the manifold hypothesis

Paper • 2504.01002 • Published Apr 1 • 1
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers

Paper • 2403.10476 • Published Mar 15, 2024 • 1
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning

Paper • 2504.00254 • Published Mar 31 • 1

GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

Paper • 2311.12631 • Published Nov 21, 2023 • 15
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Paper • 2401.06066 • Published Jan 11, 2024 • 56
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Paper • 2504.01956 • Published Apr 2 • 41
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

Paper • 2506.23219 • Published Jun 29 • 7

Mixture of Contexts for Long Video Generation

Paper • 2508.21058 • Published 9 days ago • 30

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Paper • 2508.07981 • Published 27 days ago • 58
CharacterShot: Controllable and Consistent 4D Character Animation

Paper • 2508.07409 • Published 28 days ago • 38
ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

Paper • 2508.10881 • Published 23 days ago • 51
Puppeteer: Rig and Animate Your 3D Models

Paper • 2508.10898 • Published 23 days ago • 30

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Paper • 2401.09985 • Published Jan 18, 2024 • 18
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Paper • 2401.09962 • Published Jan 18, 2024 • 9
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Paper • 2401.10404 • Published Jan 18, 2024 • 11
ActAnywhere: Subject-Aware Video Background Generation

Paper • 2401.10822 • Published Jan 19, 2024 • 13

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 29
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 13
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 45
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 24

Mixture of Contexts for Long Video Generation

Paper • 2508.21058 • Published 9 days ago • 30

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Paper • 2508.09789 • Published 25 days ago • 5
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Paper • 2508.13186 • Published 24 days ago • 17
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents

Paper • 2508.04038 • Published Aug 6 • 1
Prompt Orchestration Markup Language

Paper • 2508.13948 • Published 19 days ago • 48

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Paper • 2508.07981 • Published 27 days ago • 58
CharacterShot: Controllable and Consistent 4D Character Animation

Paper • 2508.07409 • Published 28 days ago • 38
ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing

Paper • 2508.10881 • Published 23 days ago • 51
Puppeteer: Rig and Animate Your 3D Models

Paper • 2508.10898 • Published 23 days ago • 30

Representation & Optimization

Understanding about representation sheds light on optimization

Nuclear Norm Regularization for Deep Learning

Paper • 2405.14544 • Published May 23, 2024 • 1
Token embeddings violate the manifold hypothesis

Paper • 2504.01002 • Published Apr 1 • 1
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers

Paper • 2403.10476 • Published Mar 15, 2024 • 1
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning

Paper • 2504.00254 • Published Mar 31 • 1

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Paper • 2401.09985 • Published Jan 18, 2024 • 18
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Paper • 2401.09962 • Published Jan 18, 2024 • 9
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Paper • 2401.10404 • Published Jan 18, 2024 • 11
ActAnywhere: Subject-Aware Video Background Generation

Paper • 2401.10822 • Published Jan 19, 2024 • 13

GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning

Paper • 2311.12631 • Published Nov 21, 2023 • 15
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Paper • 2401.06066 • Published Jan 11, 2024 • 56
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

Paper • 2504.01956 • Published Apr 2 • 41
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

Paper • 2506.23219 • Published Jun 29 • 7

Company

TOS Privacy About Jobs

Website

Models Datasets OCR模型免费转Markdown Pricing 模型下载攻略