Research
Data science typically involves analyzing structured tables and unstructured text to make predictions, impute missing data, discover relationships between variables, infer causal effects, or detect anomalies.
My work uses probabilistic programming to learn and query generative models for data science, such as guiding transformers to convert unstructured text into structured data, and learning GPU-efficient
generative models for tables that can solve a wide range of data science tasks.
|
|
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo
João Loula,
João Loula, Benjamin LeBrun, Li Du, Ben Lipkin, Clemente Pasti, Gabriel Grand, Tianyu Liu, Yahya Emara, Marjorie Freedman, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Alexander K. Lew, Tim Vieira, Timothy J. O'Donnell
ICLR, 2025 (Oral, <1.8% of papers)
A wide range of LLM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints nontrivially alters the distribution over sequences, usually making exact sampling intractable. In this work, building on the Language Model Probabilistic Programming framework of Lew et al. (2023), we develop an approach to approximate inference for controlled LLM generation based on sequential Monte Carlo (SMC). Our SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computation in light of new information during the course of generation. We demonstrate that our approach improves downstream performance on four challenging domains---Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis. We compare to a number of alternative and ablated approaches, showing that our accuracy improvements are driven by better approximation to the full Bayesian posterior.
|
|
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling
Benjamin Lipkin*, Benjamin LeBrun*, Jacob Hoover Vigly, João Loula, David R. MacIver, Li Du, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Tim Vieira
CoLM AWRS, 2024 (Outstanding Paper Award, <1% of papers)
The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sampling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive, since language model vocabularies often exceed 100,000 tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost - estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained language model, and as a consequence, runtime improvements are greater for better models.
|
|
Learning Bayesian Generative Models that Multitask Tabular Data via Sequential Monte Carlo
João Loula, Ulrich Schaechtle, Josh Tenenbaum, Tim O'Donnell, Vikash Mansinghka
[under review], 2026
Tabular data analysis tasks such as prediction, imputation, anomaly detection, and relationship discovery are commonplace in computational statistics, data science, and applied fields. Many popular approaches learn distinct models for every task, rather than build multivariate generative models that are guaranteed to produce coherent uncertainty measures across all queries. This paper introduces a scalable Bayesian generative modeling method for multitasking tabular data analyses, and shows that it can outperform mature, established baselines from computational statistics and deep learning. The approach learns multivariate generative models of the joint distribution over all columns in a table that provide massively parallel implementations of sampling, probability density calculation, conditioning, and marginalization. It builds on a novel GPU-accelerated, minibatch mixture Sequential Monte Carlo algorithm that exploits model structure to deliver the same asymptotic scaling as Stochastic Gradient Descent, yet produces properly weighted samples targeting the Bayesian posterior. Experiments show the approach delivers calibrated uncertainty across a broad range of dataset sizes. Experiments also show that this method can be orders of magnitude faster to train and to query than diffusion models and variational autoencoders, and outperforms task-specific methods for synthetic data generation, anomaly detection, and multiple imputation.
|
|
Learning Generative Population Models From Multiple Clinical Datasets Via Probabilistic Programming
João Loula,
Katherine M. Collins, Ulrich Schaechtle, Joshua B. Tenenbaum, Adrian Weller, Feras Saad, Timothy J. O'Donnell, Vikash Mansinghka
ICML AccMLBio, 2024
Accurate, efficient generative models of clinical populations could accelerate clinical research and improve patient outcomes. For example, such models could infer probable treatment outcomes for different subpopulations, generate high-fidelity synthetic data that can be shared across organizational boundaries, and discover new relationships among clinical variables. Using Bayesian structure learning, we show that it is possible to learn probabilistic program models of clinical populations by combining data from multiple, sparsely overlapping clinical datasets. Through experiments with multiple clinical trials and real-world evidence from census health surveys, we show that our model generates higher quality synthetic data than neural network baselines, supports more accurate inferences across datasets than traditional statistical methods, and can be queried more efficiently than both, opening up new avenues for accessible and efficient AI assistance in clinical research.
|
|
A Task and Motion Approach to the Development of Planning
João Loula,
Kelsey Allen,
Josh Tenenbaum
CogSci, 2020
Developmental psychology presents us with a puzzle: though
children are remarkably apt at planning their actions, they suf-
fer from surprising yet consistent shortcomings. We argue that
these patterns of triumph and failure can be broadly captured
by the framework of task and motion planning, where plans
are hybrid entities consisting of both a structured, symbolic
skeleton and a continuous, low-level trajectory.
|
|
Learning constraint-based planning models from demonstrations
João Loula,
Kelsey Allen,
Tom Silver,
Josh Tenenbaum
IROS, 2020
We present a framework for learning constraint-based task
and motion planning models using gradient descent. Our model
observes expert demonstrations of a task and decomposes them
into modes—segments which specify a set of constraints on
a trajectory optimization problem.
|
|
Discovering a symbolic planning language from continuous experience
João Loula,
Tom Silver,
Kelsey Allen,
Josh Tenenbaum
CogSci, 2019
We present a model that starts out with a language
of low-level physical constraints and, by observing expert
demonstrations, builds up a library of high-level concepts that
afford planning and action understanding.
|
|
Rearranging the Familiar: Testing Compositional Generalization in
Recurrent Networks
João Loula,
Marco Baroni,
Brenden Lake,
EMNLP BlackboxNLP Workshop, 2018
We extend the study of systematic compositionality in seq2seq models
to settings where the model needs only to recombine well-trained functional words.
Our findings confirm and strengthen the earlier ones: seq2seq models can be impressively good at generalizing to novel combinations of previously-seen input, but only when
they receive extensive training on the specific
pattern to be generalized
|
|
Human Learning of Video Games
Pedro Tsividis,
João Loula,
Jake Burga,
Thomas Pouncy,
Sam Gershman,
Josh Tenenbaum
NIPS Workshop on Cognitively Informed Artificial Intelligence (Spotlight Talk), 2017
Work on human-level learning in Atari-like games, learning theories from gameplay and using them to plan in a model-based manner.
|
|
Decoding fMRI activity in the time domain improves classification performance
João Loula,
Gaël Varoquaux,
Bertrand Thirion
NeuroImage, 2017
We show that fMRI decoding can be cast as a regression problem: fitting a design matrix with BOLD activation:
event classification is then easily obtained from the predicted design matrices.
Our experiments show this approach outperforms state of the art solutions, especially for designs with low inter-stimulus intervals,
and the two-step nature of the model brings time-domain interpretability.
|
|
Loading and plotting of cortical surface representations in Nilearn
Julia Huntenburg,
Alexandre Abraham,
João Loula,
Franziskus Liem,
Kamalaker Dadi,
Gaël Varoquaux
Research Ideas and Operations, 2017
We present an initial support of cortical surfaces in Python within the neuroimaging data processing toolbox Nilearn.
We provide loading and plotting functions for different surface data formats with minimal dependencies, along with examples of their application.
Limitations of the current implementation and potential next steps are discussed.
|
|