Generating Samples using Synthetic Multivariate Distributions

March 11, 2023 by guidj

The problem I address in this post is generating samples from multivariate distributions, without having any data.

Motivation

Generative models are capable of generating new data. Unlike discriminative models, which determine the likelihood of an outcome given a set of input features $P(Y|X)$, a generative model learns the joint distribution between variables $P(X,Y)$. In product development, they can be used for various use cases, including imputing missing data (e.g. with conditional models), determining the likelihood of an observed sample, or creating random samples of data. The last use case is the focus of this post.

Say we’d like to simulate requests to a prediction service, either to benchmark it or stress test it. For starters, let’s assume the model only takes one input - the number of product views by a customer in the last 30 days. To generate random samples, we can specify a normal distribution, with mean $\mu$ and variance $\sigma$, and then draw samples from it:

$$x \sim \mathcal{N(\mu, \sigma)}$$

Now, what happens if we have a second variable - say, customers’ number of visits to the website in the same time period? We can do the same thing we did with product views, which is to specify a normal distribution for the variable, and draw samples from it. But there is an issue with that - it misses out on the relationship between the two variables, since we’d be sampling each variable independently.

A better solution is to use a multivarite normal distribution, as a generative model of our data. Multivariate normal distributions capture the relationships between variables. Given a d-dimensional vector, $x = (x_{1}, x_{2}, …, x_{d})$, where each dimension represents a variable (e.g. product views, visits, ratings), we can have a multivariate distribution

$$ x \sim \mathcal{N_{d}}(\mu, \Sigma) $$

Where $\mu$ is now a vector of the means for each variable, and $\Sigma$ is the variance-covariance matrix, capturing variation between every pair of variables. If we have a dataset with records of customers, we can estimate the multivariate distribution from it. But, what can we do if we wish to generate random samples, but lack the data or lack access to it for our use case, e.g. due to user privacy concerns? We need a way to create a multivariate distribution with certain properties, from which we can randomly sample synthetic users.

Synthetic Multivariate Distributions

To create a multivarite distribution, it is insufficient to define the mean and variance for each dimension. Since the variables have relationships - e.g. some are positively and others negatively correlated - we also need to specify the covariance between every pair of variables. In a multivariate distribution, a covariance matrix $\Sigma$ is what captures those relationships.

A covariance or variance-covariance matrix is a symmetric square matrix. On it’s diagonal, it has the variance of each dimension $d$, and every entry outside the diagonal has the covariance between two dimesions of the vector $x$, $\mathcal{COV}(X_{i}, X_{j})$. Note that the covariance between variables $(X_{i}, X_{j})$ is the same as the covariance between $(X_{j}, X_{i})$, which is why the matrix is symmetric.

Now, back to our problem of generating samples, we have established that we need a valid covariance matrix. One that is symmetric, with variances of each variable along its diagonal. Thus, we ask, “how might we define such matrix”? To answer that question, we turn to one of the most important matrix decompositions.

Decomponsing Covariance

From linear algebra, we have the following factorization: $$S = Q \Lambda Q^{T}$$

where $S$ is a symmetric matrix, $Q$ is an orthonormal matrix, with orthonormal eigenvectors, and $\Lambda$ is a diagonal matrix with the eigenvalues of $S$.

Since the covariance matrix is a symmetric matrix, the decomposition should hold for it as well:

$$\Sigma = Q \Lambda Q^{T}$$

Thus, to create a valid synthetic covariance matrix, one needs to specify the eigenvalues $\Lambda$, which would define the magnitude of the variance for each dimension, and an orthonormal matrix $Q$. While $\Lambda$ can be easily defined by us, $Q$ would be more involved.

Fortunately, Mezzadri (2007) proposed a method for generating random orthogonal matrices in their paper titled ‘How to Generate Random Matrices from the Classical Compact Groups’. The method is availabe in scipy, under the stats.ortho_group.

Given a randomly generated orthogonal matrix, $Q$, and a choice of eigenvectors $\Lambda$, we can create a covariance matrix. The code below illustrates how to do that.


import numpy as np
from scipy import stats

# chosen mean values
# product views, visits - last 30D
mu = np.array([25, 10])

dim = len(mu)
# ortho matrix
ortho_matrix = stats.ortho_group.rvs(dim)
# chosen eigen values - represent magnitude of the variance
eigenvalues = np.array([10, 2])
# cov matrix; dim x dim
cov = np.matmul(ortho_matrix, np.matmul(np.diag(eigenvalues), ortho_matrix.T))

Given the mean and covariance matrix, sampling random data is straightforward:

mv_nromal = stats.multivariate_normal(mu, cov)
num_samples = 100
mv_nromal.rvs(size=num_samples)

Closing

In this post, we walked through the steps to create a multivariate normal distribution without data. Doing this is useful when we need to generate random samples of data, but lack access to real data to estimate the parameters of the distribution.

There are few things we should consider with the approach I’ve described here. The first is that we are not in control of the variance in each dimension. If we wish to have that, we need to alter the values along the diagonal of the covariance matrix, or keep generating random covariance matrices until we find one that meets our needs. Second, this method works when each variable is assumed to be normally distributed. Binary variables, thus, aren’t supported as is. Despite that, multivariate data is fairly common, and the method described here ought to work on a wide range of use cases.