João Loula

I am a PhD student at MIT advised by Vikash Mansinghka, Josh Tenenbaum, and Tim O'Donnell working on scaling data science through probabilistic programming.

Previously, I was a research intern at Meta AI Research, working with Brenden Lake and Marco Baroni, at Harvard with Sam Gershman, and at the Inria Parietal team with Bertrand Thirion and Gaël Varoquaux. I've studied at École Normale Supérieure Paris-Saclay, École Polytechnique and Universidade de São Paulo.

Email  /  CV  /  Google Scholar  /  Github

Research

Data science typically involves analyzing structured tables and unstructured text to make predictions, impute missing data, discover relationships between variables, infer causal effects, or detect anomalies. My work uses probabilistic programming to learn and query generative models for data science, such as guiding transformers to convert unstructured text into structured data, and learning GPU-efficient generative models for tables that can solve a wide range of data science tasks.

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo
João Loula, João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Alexander K. Lew, Tim Vieira, Timothy J. O'Donnell
ICLR, 2025 (Oral, <1.8% of papers)

A wide range of LLM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints nontrivially alters the distribution over sequences, usually making exact sampling intractable. In this work, building on the Language Model Probabilistic Programming framework of Lew et al. (2023), we develop an approach to approximate inference for controlled LLM generation based on sequential Monte Carlo (SMC). Our SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computation in light of new information during the course of generation. We demonstrate that our approach improves downstream performance on four challenging domains---Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis. We compare to a number of alternative and ablated approaches, showing that our accuracy improvements are driven by better approximation to the full Bayesian posterior.

Scalable tabular data modeling via GPU-accelerated probabilistic programming
João Loula, Ulrich Schaechtle, Josh Tenenbaum, Tim O'Donnell, Vikash Mansinghka
Working Paper, 2025

We present GenJaxMix, a mixture of experts generative model for tabular data that can be trained efficiently, via a novel GPU-based sequential Monte Carlo algorithm, as well as queried efficiently, via dedicated GPU implementations of sampling, conditioning, likelihood, and marginalization, whose combination can express a wide range of downstream tasks. Through empirical evaluations, we show that GenJaxMix generates more accurate synthetic data than generative models based on diffusion models, transformers, and random forests, while being faster to train. We likewise show that it can perform downstream tasks, such as conditional synthetic data generation, anomaly detection, multiple imputation, and relationship discovery, better than commonly used methods, while being much more query-time efficient.

Learning Generative Population Models From Multiple Clinical Datasets Via Probabilistic Programming
João Loula, Katherine M. Collins, Ulrich Schaechtle, Joshua B. Tenenbaum, Adrian Weller, Feras Saad, Timothy J. O'Donnell, Vikash Mansinghka
ICML AccMLBio, 2024

Accurate, efficient generative models of clinical populations could accelerate clinical research and improve patient outcomes. For example, such models could infer probable treatment outcomes for different subpopulations, generate high-fidelity synthetic data that can be shared across organizational boundaries, and discover new relationships among clinical variables. Using Bayesian structure learning, we show that it is possible to learn probabilistic program models of clinical populations by combining data from multiple, sparsely overlapping clinical datasets. Through experiments with multiple clinical trials and real-world evidence from census health surveys, we show that our model generates higher quality synthetic data than neural network baselines, supports more accurate inferences across datasets than traditional statistical methods, and can be queried more efficiently than both, opening up new avenues for accessible and efficient AI assistance in clinical research.

A Task and Motion Approach to the Development of Planning
João Loula, Kelsey Allen, Josh Tenenbaum
CogSci, 2020

Developmental psychology presents us with a puzzle: though children are remarkably apt at planning their actions, they suf- fer from surprising yet consistent shortcomings. We argue that these patterns of triumph and failure can be broadly captured by the framework of task and motion planning, where plans are hybrid entities consisting of both a structured, symbolic skeleton and a continuous, low-level trajectory.

Learning constraint-based planning models from demonstrations
João Loula, Kelsey Allen, Tom Silver, Josh Tenenbaum
IROS, 2020

We present a framework for learning constraint-based task and motion planning models using gradient descent. Our model observes expert demonstrations of a task and decomposes them into modes—segments which specify a set of constraints on a trajectory optimization problem.

Discovering a symbolic planning language from continuous experience
João Loula, Tom Silver, Kelsey Allen, Josh Tenenbaum
CogSci, 2019

We present a model that starts out with a language of low-level physical constraints and, by observing expert demonstrations, builds up a library of high-level concepts that afford planning and action understanding.

Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks
João Loula, Marco Baroni, Brenden Lake,
EMNLP BlackboxNLP Workshop, 2018

We extend the study of systematic compositionality in seq2seq models to settings where the model needs only to recombine well-trained functional words. Our findings confirm and strengthen the earlier ones: seq2seq models can be impressively good at generalizing to novel combinations of previously-seen input, but only when they receive extensive training on the specific pattern to be generalized

Human Learning of Video Games
Pedro Tsividis, João Loula, Jake Burga, Thomas Pouncy, Sam Gershman, Josh Tenenbaum
NIPS Workshop on Cognitively Informed Artificial Intelligence (Spotlight Talk), 2017

Work on human-level learning in Atari-like games, learning theories from gameplay and using them to plan in a model-based manner.

Decoding fMRI activity in the time domain improves classification performance
João Loula, Gaël Varoquaux, Bertrand Thirion
NeuroImage, 2017

We show that fMRI decoding can be cast as a regression problem: fitting a design matrix with BOLD activation: event classification is then easily obtained from the predicted design matrices. Our experiments show this approach outperforms state of the art solutions, especially for designs with low inter-stimulus intervals, and the two-step nature of the model brings time-domain interpretability.

Loading and plotting of cortical surface representations in Nilearn
Julia Huntenburg, Alexandre Abraham, João Loula, Franziskus Liem, Kamalaker Dadi, Gaël Varoquaux
Research Ideas and Operations, 2017

We present an initial support of cortical surfaces in Python within the neuroimaging data processing toolbox Nilearn. We provide loading and plotting functions for different surface data formats with minimal dependencies, along with examples of their application. Limitations of the current implementation and potential next steps are discussed.


website template credit