Feel the fear and do it anyway. 知惧而行
Take the risk or lose the chance. 不负此时

Yunong Liu

Research Engineer at Luma AI

Recently, my work has focused on multimodal generation and structured visual generation that connects pixels to code, agents, and people. At Luma AI, I lead Layering, which turns flat generated or uploaded media into editable components across vector, text, and pixel representations, so outputs can be revised, verified, and reused across multiple turns. Alongside this, I work on post-training, reward modeling, and evaluation: how to define, measure, and improve what "good" output means for real user workflows. Before Layering, I built the RL workflow for Ray3 and ran RL and other experiments, along with data work around Uni-1.

Previously, I completed my M.S. in Computer Science at Stanford, advised by Jiajun Wu, where I led work on 4D grounding of assembly instructions in internet videos. I received my BEng in Electronics and Computer Science from the University of Edinburgh, ranking 2nd in my cohort.

Structured visual generation Reward modeling Multi-turn multimodal agents 4D grounding
Yunong Liu Yunong Liu AI 1 Yunong Liu AI 2 Yunong Liu AI 3 Yunong Liu AI 4 Yunong Liu AI 5

Current Work

Multi-Turn Agent Editing

I lead Layering, which turns flat visual outputs into editable components across vectorized SVG, text, and alpha-aware pixels. The goal is to make generated media structured enough for people and agents to inspect, revise, verify, and build on across multiple turns.

RL & Reward Modeling

I built the RL workflow for Ray3 and ran reward-modeling and calibration experiments for visual generators, including VLM-as-judge graders for reward signals and held-out evaluation, calibrated against human preference and designed to resist reward over-optimization.

Model Behavior Evals

I build evals that measure visual model behavior: instruction following, language grounding, information preservation, edit success over repeated interventions, and usefulness in real workflows. These evals guide data, reward, conditioning, and model changes.

Model Releases

Uni-1

Uni-1 is Luma's unified multimodal model for interleaved language and image understanding, reasoning, and generation.

  • Ran RL and other post-training experiments, along with data work around unified multimodal understanding and generation.

Ray3

Ray3 is Luma's next-generation video model, focused on reasoning-driven generation, controllable video-to-video workflows, character reference, keyframes, HDR, and production-grade fidelity.

  • Built the Ray3 RL workflow and ran reward, data, and evaluation experiments for video generation quality, prompt following, product-relevant behavior, and VLM-as-judge reward signals.

This Year ✨

This year, I am interested in visual generation as a bridge between pixels, structure, code, feedback, and action. I want generated artifacts to be more than flat outputs: structured states that people and agents can inspect, revise, verify, and reuse in interactive multi-turn workflows. In design, this means layouts, diagrams, interfaces, product visuals, SVG, HTML, UI components, and other code-backed representations. In coding, it means artifacts that agents can read, modify, test, and verify instead of treating visual outputs as opaque images. In video and embodied settings, it points toward visual models that preserve spatial, temporal, and object-level structure well enough to support downstream reasoning and action. Layering is one attempt at this direction, with reward design and repeated-edit evals as the feedback loop around it.

Selected Publications

CaptionQA Taxonomy

CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang*, Yunong Liu*, Bohan Zhai*, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu.

(*equal contribution)
CVPR 2026 / Paper / GitHub /

Equal-contribution work introducing CaptionQA, a utility-based benchmark with 33,027 multiple-choice questions across Natural, Document, E-commerce, and Embodied AI domains. We show measurable gaps between image and caption utility even for state-of-the-art multimodal models, reflecting my interest in evals that measure whether visual representations preserve the information people actually need.

Taming generative video models for zero-shot optical flow extraction

Seungwoo Kim*, Khai Loong Aw*, Klemen Kotar*, Cristobal Eyzaguirre, Wanhee Lee, Yunong Liu, Jared Watrous, Stefan Stojanov, Juan Carlos Niebles, Jiajun Wu, Daniel L. K. Yamins.

NeurIPS 2025 / Paper / Code / Project Page /

Contributor to a counterfactual probe over video-diffusion logits that extracts optical flow with no labels and no fine-tuning. The method achieves state-of-the-art TAP-Vid results, generalizes to in-the-wild videos, and outperforms specialized optical-flow baselines.

IKEA Manuals at Work

IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos

Yunong Liu, Cristobal Eyzaguirre, Manling Li, Shubh Khanna, Juan Carlos Niebles, Vineeth Ravi, Saumitra Mishra, Weiyu Liu*, Jiajun Wu*.

NeurIPS Datasets and Benchmarks Track 2024 / Paper / Project Page / X (Twitter) Thread /

First-author work and sole student lead on the first dataset aligning real-world assembly videos with 3D models and instruction manuals. Built cross-frame optimization with temporal-consistency constraints across 34k+ frames over 98 videos, and coordinated annotation / validation for evaluating whether models can ground human instructions in real visual workflows.

COVID-19 Misinformation Detection

COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic

Yunong Liu*, Nikhil Kolluri*, Dhiraj Murthy.

(*equal contribution)
JMIR Infodemiology Vol 2, No 2 (2022) / Paper

Developed hybrid framework combining machine learning with crowdsourced annotations to combat COVID-19 misinformation. Developed systematic comparison framework across classical models (SVM, LR, BNB) and pre-trained models (BERT, RoBERTa, XLNet) on 7 dataset combinations.

Selected Toy/Course Projects

EMo-Mask Project

EMo-Mask

Emotion-controllable motion generation · Stanford CS348I · 2024 Project

Just Dance Everywhere Project

Just Dance Everywhere

Best Course Project Award · UT Austin EE379K · 2022 GitHub

Gates 3rd Floor Project

Gates 3rd Floor Render

A+ grade and course showcase · Stanford CS148 · 2024 Render Showcase

Awards

  • UT Austin Cockrell School of Engineering Fellowship (Declined), Apr 2023
  • Turing Scheme Funding, January 2022
  • Leadership in Student Opportunities Edinburgh Award, July 2021
  • 1st Year Class Medal, awarded to the top overall student in 1st year Electronics and Electrical Engineering Discipline at University of Edinburgh, July 2020

Other Experience

  • Student Tech Intern, Semiconductor Manufacturing Anomaly Detection, NXP Semiconductors, Tianjin, China, May 2021 - Aug 2021

  • Deep Learning Intern, Wind Farm Performance AI Optimization, Zealen AI (Startup), Beijing, China, Aug 2021 - Sep 2021

  • Research Assistant - NLP Project, University of Edinburgh, Worked with Prof Bonnie Webber on discourse relation analysis, July 2022 - Jan 2023

  • Teaching Demonstrator, University of Edinburgh, InfBase, Sep 2022 - Jan 2023

  • Programme Representative, University of Edinburgh, Sep 2022 - June 2023

  • Global Buddies Program Mentor, University of Edinburgh, Sep 2021 - June 2022

  • Volunteer, Sri Lanka Wildlife Conservation Society (SLWCS), Jul 2019 - Aug 2019

My Purr-fect Companions

Mewomewo

Mewomewo

DoB: October 1, 2012

Role: The Elegant Lady

Superpowers: Tsundere Queen Cleanliness Lover Fearless Explorer

Mewomewo is the queen of the house since 2012. She acts cool but secretly loves attention. She's super brave, except when it comes to mess. You'll often see her checking if the house is clean and tidy.

Xiaopang

Xiaopang (Little Fat)

DoB: July 17, 2014

Role: The Cuddly Foodie

Superpowers: Cuddle Expert Food Lover Master of Stealth (when scared)

Xiaopang, our big bundle of love, joined us in 2014. His name means "little fat", but really, he's just extra cuddly! He loves snuggles and treats equally. He might get scared easily, but his heart is as big as his appetite.