Histogram sampling

Given an unknown probability distribution and sufficient data sampled from it, it is possible to sample additional data approximating the unknown distribution with a histogram. Let’s pretend we have some data, which are sampled from an unknown distribution:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
data = np.random.rayleigh(size=100) ## unknown distribution

We can generate a histogram with an arbitrary number of bins. The more are the bins, the most precise will be the sampling, but most of the bins should contain a sufficient number of counts to be statistically significant. As a rule of thumb, I reduce the number of bins until the averaged populated bins have at least 5 elements.

bins = np.linspace(0, 5, 20)
hist, _ = np.histogram(data, bins=bins)

To sample from this histogram it is sufficient to calculate its cumulative distribution, extract a random number from a uniform distribution between 0 and 1 and finally find the index of the cumulative distribution where the random number should be inserted to keep the cumulative distribution sorted. The sampled number from the histogram will be the histogram bin corresponding to this histogram. To achieve this we can use the np.cumsum to calculate the cumulative distribution, the np.random.rand function to extract 1000 random number sampled from a uniform distribution, and the np.searchsorted function to find the index position of where these 1000 number should be inserted to preserve the ordering:

bin_midpoints = (bins[:-1] + bins[1:])/2
cdf = np.cumsum(hist)
cdf = cdf / cdf[-1]

values = np.random.rand(1000)
value_bins = np.searchsorted(cdf, values)
random_from_cdf = bin_midpoints[value_bins]

We can now plot the original (on the left) together with the histogram of the new sampled data random_from_cdf:

fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(9, 3))
ax0.hist(data, bins=bins)
ax1.hist(random_from_cdf, bins=bins)

for ax in (ax0, ax1):
    ax.set(xlim=(0,5), ylabel='Counts')
    ax.tick_params(axis='x', color='none')
    ax.tick_params(axis='y', color='none')
    [ax.spines[d].set_visible(False) for d in ('top', 'left', 'right')]

histogram comparison

Histogram sampling in 2D

In two dimension the concept is the same, the only trick is to flatten the indexes when calculating the cumulative distribution:

np.random.seed(42)
BIN_COUNT = 25

data = np.column_stack((np.random.rayleigh(scale=30, size=1000),
                        np.random.normal(scale=15, size=1000)))
x, y = data.T

hist, x_bins, y_bins = np.histogram2d(x, y, bins=(BIN_COUNT, BIN_COUNT))
x_bin_midpoints = (x_bins[:-1] + x_bins[1:])/2
y_bin_midpoints = (y_bins[:-1] + y_bins[1:])/2

cdf = np.cumsum(hist.flatten())
cdf = cdf / cdf[-1]

values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
x_idx, y_idx = np.unravel_index(value_bins,
                                (len(x_bin_midpoints),
                                 len(y_bin_midpoints)))
random_from_cdf = np.column_stack((x_bin_midpoints[x_idx],
                                   y_bin_midpoints[y_idx]))
new_x, new_y = random_from_cdf.T


fig, axs = plt.subplots(1, 2, figsize=(9, 3))

axs[0].hist2d(x, y, bins=(BIN_COUNT, BIN_COUNT))
axs[0].set_title('Original data histogram')
axs[1].hist2d(new_x, new_y, bins=(BIN_COUNT, BIN_COUNT))
axs[1].set_title('New sampled data histogram')
for ax in axs:    
    ax.set(aspect='equal')
    ax.set_axis_off()

histogram 2d comparison