Question on Heap chapter's boot camp code


#1

Hi:

I have a question regarding the code in the book camp of heaps chapter, Python version.

In the top_k(k, stream) implementation, the last line uses heapq.nsmallest function to sort the k elements in the current heap. I am just confused why not using a sort() function directly to the current heap array (if result needs to be sorted), or just output the elements in the heap array sequentially (if the result doesn’t need to be sorted)?

The code in the book is actually very similar to an implementation of nlargest(k, L) of a heap (except that the input is a stream). I searched its actual implementation: the last operation is actually sort(reverse = True). However, if you use nsmallest function in the last line, its real execution will be to construct an identical list of the current heap, heapify that list again (because the function doesn’t know it is already a heap. Though no operation is needed, checks will be done), and then pop out the elements one by one, which is wasteful in both time and space.

Am I correct?

Thanks.


#2

Sorting of an array of length n is different from getting the k smallest elements of such array (assuming k << n). That is the hint of why we use nsmallest.


#3

Hi @tsunghsienlee:

Thanks for your reply, but I am not quite sure if I made myself clear.

I do understand that “sorting of an array of length n is different from getting the k smallest (or largest) elements of such array (assuming k << n).” My question is not challenging this fact.

Let’s go back to the original problem in the boot camp. The problem is to find the k largest elements from a stream of n elements (where k might be much smaller than n). The solution is to construct and keep a min-heap of size k. By popping the smallest element in the min-heap each time when reading a new element from the stream until consuming all the n elements in the stream, one can guarantee that the remaining elements in the heap are k largest elements of the original stream.

My question is NOT “why not sorting the n elements from the beginning”. My question is: inside the code, the last line outputs the k elements remaining in the heap (which is of size k) by function heapq.nsmallest. I don’t understand why you will need to use this function for the output, given that you have only k elements (not n) in the heap now.

My understanding is: the k elements in the heap are guaranteed to be the largest k elements of the stream, so why not outputting them directly, if the output doesn’t need to be sorted? Or if it needs to be sorted, you can just sort them, there are only k elements now (not n).

Instead, if you use heapq.nsmallest function in the last step, we can see what will be happening from the inside implementation of this function in Python. The fact is: the heap of k elements will be passed to construct a new list, and that the new list will be heapified, and then the k elements will be sorted. In my view, it is unnecessary for all steps but the last one, because in fact, you know the input to the nsmallest function is already a heap, so you can just sort it (which is the last step of nsmallest function itself) and skip the other steps of a standard nsmallest function.

Not sure if I made myself clear and if I am correct?

Thanks.


#4

Basically the complexity is different, and one is O(nlogn) and the other one is O(nlogk). In this problem, of course the number of elements in min_heap is k but in this problem this two definitely has same effect but it is always good practice to use the right API here.

I think the output will be the same as sorted, but here we just use the right API even though the actually runtime complexity is the same.


#5

Thank you a lot @tsunghsienlee for the explanation!