Error in 5.12 Sample offline data when k > n / 2?

#1

Hi!

I am wondering if the following claim from the book is correct:

When k is bigger than n/2, we can optimize by computing a subset of n - k elements to remove from the set.

For example, let’s consider the list [a, b, c] from which we want a random subset of size k = 2.

The possible subsets are ab, ba, ac, ca, bc, cb (I’m ignoring formatting).

But by picking n - k = 1 element to remove, we can only get the following final arrays:

  • [a, b, c] (we picked a and swapped it with itself). This give bc
  • [b, a, c] (we picked b and swapped it with a). This gives ac
  • [c, b, a] (we picked c and swapped it with a). This gives ba

Which has 2 drawbacks:

  • Now the “random subset” is at the end of the array instead of at the beginning. This can be fixed by swapping with the last element instead of the first one, but that’s extra code.
  • More importantly, some other subsets are impossible to generate.

Maybe I am missing something, or not doing it correctly?

Thanks for any feedback!

0 Likes

#2
  1. Instead of putting elements at the start, you can put them at the back while computing.
  2. Although, it is missing subsets, the probability is still the same. This is exactly where it is saving time. For example, consider subset [b,c]. If we generate all six subsets, the probability of getting this subset is 1/3 because [b,c] and [c,b] are same. And when you generate three subsets, the probability is still 1/3.
0 Likes