Yunong Liu

Research Engineer at Luma AI

I work on multimodal image/video generation models, structured visual generation, post-training, reward modeling, and multimodal evaluation.

I lead Layering, a structured visual generation system that turns flat generated images or design media into editable raster, text, vector, and alpha-aware components for multi-turn human and agent editing.

Before Layering, I worked across Luma's image and video generation models, including Ray3 and Uni-1. I built the experimental RL/post-training workflow for Ray3 video generation, including sample generation loops, reward modeling, VLM-as-judge graders, held-out evaluation, and diffusion RL, DPO-style, and GRPO-style experiments. I also contributed RL/data/eval experiments for Uni-1, including OCR-focused rewards, caption/data ablations, and early evaluation.

Before Luma, I completed my Stanford MSCS with Jiajun Wu, where I led IKEA Manuals at Work, a first-author NeurIPS 2024 project on 4D grounding of real-world assembly videos to 3D structure over time. I received my BEng in Electronics and Computer Science from the University of Edinburgh, ranking 2nd in my cohort.

Image/video generation Structured visual generation Reward modeling Multimodal evaluation

Current Work

Multi-Turn Agent Editing

I lead Layering, which turns flat generated images or design media into editable components across vectorized SVG, text, and alpha-aware pixels. The goal is to make generated media structured enough for people and agents to inspect, revise, verify, and build on across multiple turns.

RL & Reward Modeling

I worked across Luma's image and video generation models, including Ray3 and Uni-1. For Ray3, I built the experimental RL/post-training workflow for video generation, including sample generation loops, diffusion RL, DPO-style and GRPO-style experiments, reward modeling, VLM-as-judge graders, and held-out evaluation calibrated against human preference.

Model Behavior Evals

I build evals that measure visual model behavior: instruction following, language grounding, information preservation, edit success over repeated interventions, and usefulness in real workflows. These evals guide data, reward, conditioning, and model changes.

Model Releases

Ray3

Ray3 is Luma's next-generation video model, focused on reasoning-driven generation, controllable video-to-video workflows, character reference, keyframes, HDR, and production-grade fidelity.

Built the experimental RL/post-training workflow for video generation, including sample generation loops, reward modeling, VLM-as-judge graders, held-out evaluation, and diffusion RL, DPO-style, and GRPO-style experiments.

Launch Page / Launch Video

Uni-1

Uni-1 is Luma's unified multimodal model for interleaved language and image understanding, reasoning, and generation.

Contributed RL/data/eval experiments for unified multimodal image understanding and generation, including OCR-focused rewards, caption/data ablations, and early evaluation.

Launch Page / Launch Video

This Year ✨

This year, I am interested in visual generation as a bridge between pixels, structure, code, feedback, and action. I want generated artifacts to be more than flat outputs: structured states that people and agents can inspect, revise, verify, and reuse in interactive multi-turn workflows. In design, this means layouts, diagrams, interfaces, product visuals, SVG, HTML, UI components, and other code-backed representations. In coding, it means artifacts that agents can read, modify, test, and verify instead of treating visual outputs as opaque images. In video and embodied settings, it points toward visual models that preserve spatial, temporal, and object-level structure well enough to support downstream reasoning and action. Layering is one attempt at this direction, with reward design and repeated-edit evals as the feedback loop around it.

Selected Publications

CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang*, Yunong Liu*, Bohan Zhai*, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu.

(*equal contribution)

CVPR 2026 / Paper / GitHub /

Equal-contribution work introducing CaptionQA, a utility-based benchmark with 33,027 multiple-choice questions across Natural, Document, E-commerce, and Embodied AI domains. We show measurable gaps between image and caption utility even for state-of-the-art multimodal models, reflecting my interest in evals that measure whether visual representations preserve the information people actually need.

Taming generative video models for zero-shot optical flow extraction

Seungwoo Kim*, Khai Loong Aw*, Klemen Kotar*, Cristobal Eyzaguirre, Wanhee Lee, Yunong Liu, Jared Watrous, Stefan Stojanov, Juan Carlos Niebles, Jiajun Wu, Daniel L. K. Yamins.

NeurIPS 2025 / Paper / Code / Project Page /

Contributor to a counterfactual probe over video-diffusion logits that extracts optical flow with no labels and no fine-tuning. The method achieves state-of-the-art TAP-Vid results, generalizes to in-the-wild videos, and outperforms specialized optical-flow baselines.

IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos

Yunong Liu, Cristobal Eyzaguirre, Manling Li, Shubh Khanna, Juan Carlos Niebles, Vineeth Ravi, Saumitra Mishra, Weiyu Liu*, Jiajun Wu*.

NeurIPS Datasets and Benchmarks Track 2024 / Paper / Project Page / X (Twitter) Thread /

First-author work and sole student lead on the first dataset aligning real-world assembly videos with 3D models and instruction manuals. Built cross-frame optimization with temporal-consistency constraints across 34k+ frames over 98 videos, and coordinated annotation / validation for evaluating whether models can ground human instructions in real visual workflows.

COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic

Yunong Liu*, Nikhil Kolluri*, Dhiraj Murthy.

(*equal contribution)

JMIR Infodemiology Vol 2, No 2 (2022) / Paper

Developed hybrid framework combining machine learning with crowdsourced annotations to combat COVID-19 misinformation. Developed systematic comparison framework across classical models (SVM, LR, BNB) and pre-trained models (BERT, RoBERTa, XLNet) on 7 dataset combinations.

Selected Toy/Course Projects

EMo-Mask

Emotion-controllable motion generation · Stanford CS348I · 2024 Project

Just Dance Everywhere

Best Course Project Award · UT Austin EE379K · 2022 GitHub

Gates 3rd Floor Render

A+ grade and course showcase · Stanford CS148 · 2024 Render Showcase

Awards

UT Austin Cockrell School of Engineering Fellowship (Declined), Apr 2023
Turing Scheme Funding, January 2022
Leadership in Student Opportunities Edinburgh Award, July 2021
1st Year Class Medal, awarded to the top overall student in 1st year Electronics and Electrical Engineering Discipline at University of Edinburgh, July 2020

Other Experience

Student Tech Intern, Semiconductor Manufacturing Anomaly Detection, NXP Semiconductors, Tianjin, China, May 2021 - Aug 2021
Deep Learning Intern, Wind Farm Performance AI Optimization, Zealen AI (Startup), Beijing, China, Aug 2021 - Sep 2021
Research Assistant - NLP Project, University of Edinburgh, Worked with Prof Bonnie Webber on discourse relation analysis, July 2022 - Jan 2023
Teaching Demonstrator, University of Edinburgh, InfBase, Sep 2022 - Jan 2023
Programme Representative, University of Edinburgh, Sep 2022 - June 2023
Global Buddies Program Mentor, University of Edinburgh, Sep 2021 - June 2022
Volunteer, Sri Lanka Wildlife Conservation Society (SLWCS), Jul 2019 - Aug 2019