shankars
's Collections
AI-paper
updated
Describe What You See with Multimodal Large Language Models to Enhance
Video Recommendations
Paper
•
2508.09789
•
Published
•
5
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper
•
2508.13186
•
Published
•
17
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval
Driven LLM Agents
Paper
•
2508.04038
•
Published
•
1
Prompt Orchestration Markup Language
Paper
•
2508.13948
•
Published
•
48
MultiRef: Controllable Image Generation with Multiple Visual References
Paper
•
2508.06905
•
Published
•
21
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
Paper
•
2508.14041
•
Published
•
57
Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent
Distillation and Agentic RL
Paper
•
2508.13167
•
Published
•
123
Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic
Thought Reward
Paper
•
2508.12800
•
Published
•
5
Copyright Protection for Large Language Models: A Survey of Methods,
Challenges, and Trends
Paper
•
2508.11548
•
Published
•
5
Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
Paper
•
2508.08777
•
Published
•
15
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion
Transformer
Paper
•
2508.09131
•
Published
•
16
MCP-Universe: Benchmarking Large Language Models with Real-World Model
Context Protocol Servers
Paper
•
2508.14704
•
Published
•
42
From AI for Science to Agentic Science: A Survey on Autonomous
Scientific Discovery
Paper
•
2508.14111
•
Published
•
32
RynnEC: Bringing MLLMs into Embodied World
Paper
•
2508.14160
•
Published
•
18
Perception, Reason, Think, and Plan: A Survey on Large Multimodal
Reasoning Models
Paper
•
2505.04921
•
Published
•
186
Evolving Deeper LLM Thinking
Paper
•
2501.09891
•
Published
•
116
A Survey on Large Language Model Benchmarks
Paper
•
2508.15361
•
Published
•
18
Deep Think with Confidence
Paper
•
2508.15260
•
Published
•
81
ReFocus: Visual Editing as a Chain of Thought for Structured Image
Understanding
Paper
•
2501.05452
•
Published
•
15
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models
Paper
•
2504.15279
•
Published
•
75
Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
Paper
•
2406.14562
•
Published
•
29
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
•
2501.06186
•
Published
•
66
Thinking with Generated Images
Paper
•
2505.22525
•
Published
•
15
ChartMuseum: Testing Visual Reasoning Capabilities of Large
Vision-Language Models
Paper
•
2505.13444
•
Published
•
16
We-Math: Does Your Large Multimodal Model Achieve Human-like
Mathematical Reasoning?
Paper
•
2407.01284
•
Published
•
82
ComposeAnything: Composite Object Priors for Text-to-Image Generation
Paper
•
2505.24086
•
Published
•
5
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
Future Frontiers
Paper
•
2506.23918
•
Published
•
86
Visual Planning: Let's Think Only with Images
Paper
•
2505.11409
•
Published
•
57
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
•
2407.07053
•
Published
•
48
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Paper
•
2403.12884
•
Published
•
1
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography
Paper
•
2504.10090
•
Published
Visual Programming: Compositional visual reasoning without training
Paper
•
2211.11559
•
Published
•
1
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
Modules for Compositional Visual Reasoning
Paper
•
2408.02210
•
Published
•
9
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Paper
•
2412.18072
•
Published
•
20
Intern-S1: A Scientific Multimodal Foundation Model
Paper
•
2508.15763
•
Published
•
243
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Paper
•
2504.06261
•
Published
•
111
Star Attention: Efficient LLM Inference over Long Sequences
Paper
•
2411.17116
•
Published
•
56
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday
Home Clusters
Paper
•
2504.08791
•
Published
•
134
LLM Inference Unveiled: Survey and Roofline Model Insights
Paper
•
2402.16363
•
Published
•
2
Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled
Architectures
Paper
•
2504.11750
•
Published
Efficient Diffusion Models: A Comprehensive Survey from Principles to
Practices
Paper
•
2410.11795
•
Published
•
18
Generative AI for Character Animation: A Comprehensive Survey of
Techniques, Applications, and Future Directions
Paper
•
2504.19056
•
Published
•
18
Personalized Image Generation with Deep Generative Models: A Decade
Survey
Paper
•
2502.13081
•
Published
Diffusion Models: A Comprehensive Survey of Methods and Applications
Paper
•
2209.00796
•
Published
An Empirical Study of GPT-4o Image Generation Capabilities
Paper
•
2504.05979
•
Published
•
64
ImageRAG: Dynamic Image Retrieval for Reference-Guided Image Generation
Paper
•
2502.09411
•
Published
•
21
A survey of Generative AI Applications
Paper
•
2306.02781
•
Published
Text-to-image Diffusion Models in Generative AI: A Survey
Paper
•
2303.07909
•
Published
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
Paper
•
2501.06322
•
Published
•
1
Multi-Agent Collaboration via Evolving Orchestration
Paper
•
2505.19591
•
Published
•
1
GenMAC: Compositional Text-to-Video Generation with Multi-Agent
Collaboration
Paper
•
2412.04440
•
Published
•
22
AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose
Task Solving
Paper
•
2506.12508
•
Published
•
1
Internet of Agents: Weaving a Web of Heterogeneous Agents for
Collaborative Intelligence
Paper
•
2407.07061
•
Published
•
28
VideoTetris: Towards Compositional Text-to-Video Generation
Paper
•
2406.04277
•
Published
•
26
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video
Generation
Paper
•
2407.14505
•
Published
•
27
DreamRunner: Fine-Grained Storytelling Video Generation with
Retrieval-Augmented Motion Adaptation
Paper
•
2411.16657
•
Published
•
20
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Paper
•
2411.10818
•
Published
•
27
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Paper
•
2312.14125
•
Published
•
47
PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
Paper
•
2504.03664
•
Published
FlexInfer: Breaking Memory Constraint via Flexible and Efficient
Offloading for On-Device LLM Inference
Paper
•
2503.03777
•
Published
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
Paper
•
2503.16163
•
Published
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
Paper
•
2502.12574
•
Published
•
12
Seesaw: High-throughput LLM Inference via Model Re-sharding
Paper
•
2503.06433
•
Published
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving
Under Resource Constraints
Paper
•
2504.09345
•
Published
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
•
2504.10479
•
Published
•
286
MV-RAG: Retrieval Augmented Multiview Diffusion
Paper
•
2508.16577
•
Published
•
36
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance
for Text-to-Image Generation
Paper
•
2508.18032
•
Published
•
40
PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent
LLMs
Paper
•
2508.17188
•
Published
•
15
Explain Before You Answer: A Survey on Compositional Visual Reasoning
Paper
•
2508.17298
•
Published
•
4
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
Paper
•
2508.16153
•
Published
•
132
AgentScope 1.0: A Developer-Centric Framework for Building Agentic
Applications
Paper
•
2508.16279
•
Published
•
30
CineScale: Free Lunch in High-Resolution Cinematic Visual Generation
Paper
•
2508.15774
•
Published
•
19
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Paper
•
2508.19652
•
Published
•
79
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding
in Vision-Language-Action Policies
Paper
•
2508.20072
•
Published
•
28
AudioStory: Generating Long-Form Narrative Audio with Large Language
Models
Paper
•
2508.20088
•
Published
•
20
MotionFlux: Efficient Text-Guided Motion Generation through Rectified
Flow Matching and Preference Alignment
Paper
•
2508.19527
•
Published
•
9
Taming the Chaos: Coordinated Autoscaling for Heterogeneous and
Disaggregated LLM Inference
Paper
•
2508.19559
•
Published
•
5
Mixture of Contexts for Long Video Generation
Paper
•
2508.21058
•
Published
•
30
rStar2-Agent: Agentic Reasoning Technical Report
Paper
•
2508.20722
•
Published
•
97
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable
Text-to-Image Reinforcement Learning
Paper
•
2508.20751
•
Published
•
85
AWorld: Orchestrating the Training Recipe for Agentic AI
Paper
•
2508.20404
•
Published
•
37
Dress&Dance: Dress up and Dance as You Like It - Technical Preview
Paper
•
2508.21070
•
Published
•
5
ROSE: Remove Objects with Side Effects in Videos
Paper
•
2508.18633
•
Published
•
7
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for
General Robot Control
Paper
•
2508.21112
•
Published
•
72
A.S.E: A Repository-Level Benchmark for Evaluating Security in
AI-Generated Code
Paper
•
2508.18106
•
Published
•
197
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs
via Bi-Mode Annealing and Reinforce Learning
Paper
•
2508.21113
•
Published
•
104
AHELM: A Holistic Evaluation of Audio-Language Models
Paper
•
2508.21376
•
Published
•
9
Morae: Proactively Pausing UI Agents for User Choices
Paper
•
2508.21456
•
Published
•
5
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper
•
2508.21767
•
Published
•
12
Efficient Code Embeddings from Code Generation Models
Paper
•
2508.21290
•
Published
•
18
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model
Pre-training
Paper
•
2508.17677
•
Published
•
14
CLIPSym: Delving into Symmetry Detection with CLIP
Paper
•
2508.14197
•
Published
•
7
A Survey of Scientific Large Language Models: From Data Foundations to
Agent Frontiers
Paper
•
2508.21148
•
Published
•
132
Continual Learning for Large Language Models: A Survey
Paper
•
2402.01364
•
Published
•
1
Continual Learning with Pre-Trained Models: A Survey
Paper
•
2401.16386
•
Published
•
1
Continual Learning: Applications and the Road Forward
Paper
•
2311.11908
•
Published
•
1
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Paper
•
2509.02547
•
Published
•
156
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn
Tool-Integrated Reasoning
Paper
•
2509.02479
•
Published
•
76
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long
Video Understanding
Paper
•
2508.21496
•
Published
•
53
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
Paper
•
2509.01055
•
Published
•
60
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models
for Document Conversion
Paper
•
2509.01215
•
Published
•
42
GenCompositor: Generative Video Compositing with Diffusion Transformer
Paper
•
2509.02460
•
Published
•
22
OpenVision 2: A Family of Generative Pretrained Visual Encoders for
Multimodal Learning
Paper
•
2509.01644
•
Published
•
26
Mixture of Global and Local Experts with Diffusion Transformer for
Controllable Face Generation
Paper
•
2509.00428
•
Published
•
12
From Editor to Dense Geometry Estimator
Paper
•
2509.04338
•
Published
•
74
Drawing2CAD: Sequence-to-Sequence Learning for CAD Generation from
Vector Drawings
Paper
•
2508.18733
•
Published
•
4
Towards a Unified View of Large Language Model Post-Training
Paper
•
2509.04419
•
Published
•
54