Time complexity for Sample Online Data 5.13

#1

The book says time complexity for the brute-force solution is o(nk). Isn’t this wrong? We are first reading all elements which will take O(n) time and O(n) memory. Then we are essentially using the same solution for 5.12 (Sample Offline Data) to get the random subset. This will take O(k) additional time. So the total time complexity should be O(n+k). Now since k<n and k=n in worst case, time complexity drops to O(2n) or O(n). Whereas according to the book, the worst case complexity seems to be O(n*n) or O(n^2). And this is not true. How can I get this rectified for future versions?

0 Likes

#2

I think your misunderstanding is that you’re assuming you’ll only need to store the n packets, and perform the random sampling one time. This would be fine if you only wanted the random sampling at the end of the stream. But imagine for example if you wanted to see representative data at a few different times while the stream is still going. This is the problem the book is presenting, and it’s a bit different.

The book states you must “maintain a uniform random subset of size k…” meaning that at any point in the stream, the subset should already exist. In order to make this work, using the 5.12 solution, you’d need to recompute the k-length subset for EACH packet in order to guarantee randomness. Your solution is O(nk) because you’re assuming you’d only compute a subset one time, at the end of the stream. But in reality, you need to maintain a changing subset which is why that approach has a higher time complexity as noted in the book.

0 Likes