Minghua Liu

I am a PhD student in computer science at the University of California, San Diego, where I am fortunate to be advised by Prof. Hao Su. My research primarily focuses on 3D vision and embodied AI. For a succinct overview of my selected leading projects, please refer to this figure.

I had great experiences at Qualcomm, Waymo, Adobe, Sensetime, and USC. Before embarking on my Ph.D. journey, I completed my undergraduate studies in computer science at Tsinghua University, under the mentorship of Prof. Shi-Min Hu.

[CV][Google Scholar] [Twitter]


One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

Minghua Liu*, Ruoxi Shi*, Linghao Chen*, Zhuoyang Zhang*, Chao Xu*, Xinyue Wei, Hansheng Chen, Chong Zeng, Jiayuan Gu, Hao Su.

[project] [demo] [PDF]

Recent advancements in open-world 3D object generation have been remarkable, with image-to-3D methods offering superior fine-grained control over their text-to-3D counterparts. However, most existing models fall short in simultaneously providing rapid generation speeds and high fidelity to input images - two features essential for practical applications. In this paper, we present One-2-3-45++, an innovative method that transforms a single image into a detailed 3D textured mesh in approximately one minute. The generated meshes closely mirror the of the original input image.

Zero123++: A Single Image to Consistent Multi-view Diffusion Base Model

Ruoxi Shi, Hansheng Chen, Zhuoyang Zhang, Minghua Liu, Chao Xu, Xinyue Wei, Linghao Chen, Chong Zeng, Hao Su.

Technical Report [PDF] [code] [demo]

We report Zero123++, an image-conditioned diffusion model for generating 3D-consistent multi-view images from a single input view.

One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

Minghua Liu*, Chao Xu*, Haian Jin*, Linghao Chen*, Mukund Varma T, Zexiang Xu, Hao Su.

NeurIPS 2023 [project] [code] [demo] [PDF]

Many existing image-to-3D methods optimizes a neural radiance field under the guidance of 2D diffusion models but suffer from lengthy optimization time, 3D inconsistency results, and poor geometry. In this work, we propose a novel method that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. Without costly optimizations, our method reconstructs 3D shapes in significantly less time than existing methods. Moreover, our method favors better geometry, generates more 3D consistent results, and adheres more closely to the input image. In addition, our approach can seamlessly support the text-to-3D task by integrating with off-the-shelf text-to-image diffusion models.

OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

Minghua Liu*, Ruoxi Shi*, Kaiming Kuang*, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, Hao Su.

NeurIPS 2023 [project] [PDF] [code] [demo]

We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds. OpenShape demonstrates superior capabilities for open-world recognition and achieves a zero-shot accuracy of 46.8% on the 1,156-category Objaverse-LVIS benchmark and an accuracy of 85.3 on ModelNet40. Moreover, we show that our learned embeddings encode a wide range of visual and semantic concepts and facilitate fine-grained text-3D and image-3D interactions. Due to their alignment with CLIP embeddings, our learned shape representations can also be integrated with off-the-shelf CLIP-based models for various applications.

Learning Reusable Dense Rewards for Multi-Stage Tasks

Tongzhou Mu, Minghua Liu, Hao Su.

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

Xuanlin Li*, Yunhao Fang*, Minghua Liu, Zhan Ling, Zhuowen Tu, Hao Su.

ICCV 2023 [PDF] [code]

We investigate the distillation of visual representations in large teacher vision-language models into lightweight student models using a small- or mid-scale dataset, aiming to maintain the performance of teacher models.

PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models

Minghua Liu, Yinhao Zhu, Hong Cai, Shizhong Han, Zhan Ling, Fatih Porikli, Hao Su.

CVPR 2023 [project] [PDF] [code (please email me)] [slides] [video]

This paper explores a novel way for low-shot part segmentation of 3D point clouds by leveraging a pretrained image-language model. We show that our method enables excellent zero-shot 3D part segmentation. Our few-shot version not only outperforms existing few-shot approaches by a large margin but also achieves highly competitive results compared to the fully supervised counterpart. Furthermore, we demonstrate that our method can be directly applied to iPhone-scanned point clouds without significant domain gaps.

Frame Mining: a Free Lunch for Learning Robotic Manipulation from 3D Point Clouds

Minghua Liu*, Xuanlin Li*, Zhan Ling*, Yangyan Li, Hao Su.

CoRL 2022 [project][PDF] [slides] [code] [video]

We study how choices of input point cloud coordinate frames affect learning of manipulation skills from 3D point clouds. We find that different frames lead to distinct agent learning performance. Since the well-performing frames vary across tasks, and some tasks may benefit from multiple frame candidates, we thus propose FrameMiners to adaptively select candidate frames and fuse their merits in a task-agnostic manner. Without changing existing camera placements or adding extra cameras, point cloud frame mining can serve as a free lunch to improve 3D manipulation learning.

LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds

Minghua Liu, Yin Zhou, Charles R. Qi, Boqing Gong, Hao Su, Dragomir Anguelov.

ECCV 2022 (Oral) [PDF]

We propose a label-efficient semantic segmentation method for outdoor LiDAR Point Clouds. It co-designs an efficient labeling process with semi/weakly supervised learning and is applicable to nearly any 3D semantic segmentation backbones. With extremely limited human annotations (e.g., 0.1% point labels), our proposed method is highly competitive compared to the fully supervised counterpart with 100% labels.

Approximate Convex Decomposition for 3D Meshes with Collision-Aware Concavity and Tree Search

Xinyue Wei*, Minghua Liu*, Zhan Lin, Hao Su

SIGGRAPH 2022 (Journal track) [project] [PDF] [code] [video]

Approximate convex decomposition enables efficient geometry processing algorithms specifically designed for convex shapes (e.g., collision detection). We propose a method that is better to preserve collision conditions of the input shape with fewer components. It thus supports delicate and efficient object interaction in downstream applications.

Close the Visual Domain Gap by Physics-Grounded Active Stereovision Depth Sensor Simulation

Xiaoshuai Zhang*, Rui Chen*, Fanbo Xiang**, Yuzhe Qin**, Jiayuan Gu**, Zhan Ling**, Minghua Liu**, Peiyu Zeng**, Songfang Han***, Zhiao Huang***, Tongzhou Mu***, Jing Xu, Hao Su

TR-O 2023 [PDF]

We focus on the simulation of active stereovision depth sensors and designed a fully physics-grounded simulation pipeline, which includes material acquisition, ray tracing based infrared (IR) image rendering, IR noise simulation, and depth estimation. The pipeline is able to generate depth maps with material-dependent error patterns similar to a real depth sensor.

DeepMetaHandles: Learning Deformation Meta-Handles of 3D Meshes with Biharmonic Coordinates

Minghua Liu, Minhyuk Sung, Radomir Mech, Hao Su

CVPR 2021 (Oral) [project] [PDF] [code] [animations]

We presented DeepMetaHandles, a 3D conditional generative model based on mesh deformation. Our method takes automatically-generated control points with biharmonic coordinates as deformation handles, and learns a latent space of deformation for each input mesh. Each axis of the space is explicitly associated with multiple deformation handles, and it's thus called a meta-handle. The disentangled meta-handles factorize all the plausible deformations of the shape, while each of them conforms to an intuitive deformation. We learn the meta-handles unsupervisely by incorporating a target-driven deformation module. We also employ a differentiable render and a 2D discriminator to enhance the plausibility of the deformation.

Meshing Point Clouds with Predicted Intrinsic-Extrinsic Ratio Guidance

Minghua Liu, Xiaoshuai Zhang, Hao Su

ECCV 2020 [PDF] [code]

We propose a mesh reconstruction method that leverage the input point cloud as much as possible, by only adding connectivity information to existing points. Particularly, we predict which triplets of points should form faces. Our key innovation is a surrogate of local connectivity, calculated by comparing the intrinsic/extrinsic metrics. We learn to predict this surrogate using a deep point cloud network and then feed it to an efficient post-processing module for high-quality mesh generation. We demonstrate that our method can not only preserve details, handle ambiguous structures, but also possess strong generalizability to unseen categories by experiments on synthetic and real data.

SAPIEN: A SimulAted Part-based Interactive ENvironment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel Chang, Leonidas Guibas, Hao Su

CVPR 2020 (Oral) [PDF] [project]

SAPIEN, is a realistic and physics-rich simulated environment that hosts a large-scale set for articulated objects. It enables various robotic vision and interaction tasks that require detailed part-level understanding.

Morphing and Sampling Network for Dense Point Cloud Completion

Minghua Liu, Lu Sheng, Sheng Yang, Jing Shao, Shi-Min Hu

AAAI 2020 [PDF] [code] [data]

For acquiring high-fidelity dense point clouds and avoiding uneven distribution, blurred details, or structural loss of existing methods’ results, we propose a novel approach to complete the partial point cloud in two stages. Specifically, in the first stage, the approach predicts a complete but coarse-grained point cloud with a collection of parametric surface elements. Then, in the second stage, it merges the coarse-grained prediction with the input point cloud by a novel sampling algorithm, and then learns a point-wise residual for the combination. Our method utilizes a joint loss function to guide the distribution of the points.

Multi-task Batch Reinforcement Learning with Metric Learning

Jiachen Li*, Quan Vuong*, Shuang Liu, Minghua Liu, Kamil Ciosek, Henrik Iskov Christensen, Hao Su

NeurIPS 2020 [PDF] [code]

We tackle the Multi-task Batch Reinforcement Learning problem.

Task and Path Planning for Multi-Agent Pickup and Delivery

Minghua Liu, Hang Ma, Jiaoyang Li, Sven Koenig

AAMAS 2019 [PDF] [slides]

We study the Multi-Agent Pickup-and-Delivery (MAPD) problem, where a team of agents has to execute a batch of tasks in a known environment. To execute a task, an agent has to move first from its current location to the pickup location of the task and then to the delivery location of the task. The MAPD problem is to assign tasks to agents and plan collision-free paths for them to execute their tasks. Online MAPD algorithms can be applied to the offline MAPD problem, but do not utilize all of the available information and may thus not be effective. Therefore, we present two novel offline MAPD algorithms.

HeteroFusion: Dense Scene Reconstruction Integrating Multi-sensors

Sheng Yang, Beichen Li, Minghua Liu, Yu-Kun Lai, Leif Kobbelt, Shi-Min Hu


We present a real-time approach that integrates multiple sensors for dense reconstruction of 3D indoor scenes. Existing algorithms are mainly based on a single RGBD camera and require continuous scanning on areas with sufficient geometric details. Failing to do so can lead to tracking loss. We incorporate multiple types of sensors, which are prevalently equipped in modern robots, including a 2D range sensor, an IMU, and wheel encoders to reinforce the tracking process and obtain better mesh construction.

Saliency-Aware Real-Time Volumetric Fusion for Object Reconstruction

Sheng Yang, Kang Chen, Minghua Liu, Hongbo Fu and Shi-Min Hu

Pacific Graphics 2017 [PDF]

We present a real-time approach for acquiring 3D objects with high fidelity using hand-held consumer-level RGB-D scanning devices. Existing real-time reconstruction methods might fail to produce clean reconstruction results of desired objects due to distracting objects or backgrounds. To address these issues, we incorporate visual saliency into a traditional real-time volumetric fusion pipeline. Salient regions detected from RGB-D frames suggest user-intended objects, and by understanding user intentions, our approach can put more emphasis on important targets, and meanwhile, eliminate disturbance of non-important objects.


Conference Reviewer: CVPR'21'22'23'24, ICCV'21'23, ECCV'22, SIGGRAPH'23, SIGGRAPH Asia'23, NeurIPS'21'22'23, ICLR'22'23'24, ICML'22'23, AAAI'23'24

Journal Reviewer: T-RO, RA-L, TPAMI, TVCG, CVMJ


I like hiking, camping, fishing, skiing, and various outdoor activities in my spare time.

I really love spicy food.

I have an adorable Labrador named Jojo.