# Histogram sampling

Given an unknown probability distribution and sufficient data sampled from it, it is possible to sample additional data approximating the unknown distribution with a histogram. Letâ€™s pretend we have some `data`

, which are sampled from an unknown distribution:

```
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
data = np.random.rayleigh(size=100) ## unknown distribution
```

We can generate a histogram with an arbitrary number of bins. The more are the bins, the most precise will be the sampling, but most of the bins should contain a sufficient number of counts to be statistically significant. As a rule of thumb, I reduce the number of bins until the averaged populated bins have at least 5 elements.

```
bins = np.linspace(0, 5, 20)
hist, _ = np.histogram(data, bins=bins)
```

To sample from this histogram it is sufficient to calculate its cumulative distribution, extract a random number from a uniform distribution between `0`

and `1`

and finally find the index of the cumulative distribution where the random number should be inserted to keep the cumulative distribution sorted. The sampled number from the histogram will be the histogram bin corresponding to this histogram. To achieve this we can use the `np.cumsum`

to calculate the cumulative distribution, the `np.random.rand`

function to extract `1000`

random number sampled from a uniform distribution, and the `np.searchsorted`

function to find the index position of where these `1000`

number should be inserted to preserve the ordering:

```
bin_midpoints = (bins[:-1] + bins[1:])/2
cdf = np.cumsum(hist)
cdf = cdf / cdf[-1]
values = np.random.rand(1000)
value_bins = np.searchsorted(cdf, values)
random_from_cdf = bin_midpoints[value_bins]
```

We can now plot the original (on the left) together with the histogram of the new sampled data `random_from_cdf`

:

```
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(9, 3))
ax0.hist(data, bins=bins)
ax1.hist(random_from_cdf, bins=bins)
for ax in (ax0, ax1):
ax.set(xlim=(0,5), ylabel='Counts')
ax.tick_params(axis='x', color='none')
ax.tick_params(axis='y', color='none')
[ax.spines[d].set_visible(False) for d in ('top', 'left', 'right')]
```

## Histogram sampling in 2D

In two dimension the concept is the same, the only trick is to flatten the indexes when calculating the cumulative distribution:

```
np.random.seed(42)
BIN_COUNT = 25
data = np.column_stack((np.random.rayleigh(scale=30, size=1000),
np.random.normal(scale=15, size=1000)))
x, y = data.T
hist, x_bins, y_bins = np.histogram2d(x, y, bins=(BIN_COUNT, BIN_COUNT))
x_bin_midpoints = (x_bins[:-1] + x_bins[1:])/2
y_bin_midpoints = (y_bins[:-1] + y_bins[1:])/2
cdf = np.cumsum(hist.flatten())
cdf = cdf / cdf[-1]
values = np.random.rand(10000)
value_bins = np.searchsorted(cdf, values)
x_idx, y_idx = np.unravel_index(value_bins,
(len(x_bin_midpoints),
len(y_bin_midpoints)))
random_from_cdf = np.column_stack((x_bin_midpoints[x_idx],
y_bin_midpoints[y_idx]))
new_x, new_y = random_from_cdf.T
fig, axs = plt.subplots(1, 2, figsize=(9, 3))
axs[0].hist2d(x, y, bins=(BIN_COUNT, BIN_COUNT))
axs[0].set_title('Original data histogram')
axs[1].hist2d(new_x, new_y, bins=(BIN_COUNT, BIN_COUNT))
axs[1].set_title('New sampled data histogram')
for ax in axs:
ax.set(aspect='equal')
ax.set_axis_off()
```