1.57 Storing indices as values to save space.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

1.57 Storing indices as values to save space.

Tim Finer
Hi Adam,

I'm trying yet another approach, heh.  This time with 1.57.

I'm going to see if I can't just store the spatial data in a memory
mapped vector and indices into it in an in-memory rtree, as long as I
relax my requirement for the size of the data a bit.

What I thought I'd try is storing the smallest sized unsigned integer
index possible (given the data) and reference a shared memory that I
control the layout of.  As a test, I set up an rtree with the value of
std::uint32_t and passed in an IndexableGetter functor that returns an
indexable double triplet when it receives the std::uint32_t value.  I
used the packing constructor to minimize the memory footprint.  This all
works, except, it looks like the rtree really stores a pair of an
indexable point and a value and not just the values.

I did some before and after memory snapshots and determined the size of
the memory used is the same regardless if I set the value to an index,
or if I fill the rtree with the double triplets directly.  My
expectation was that the functor would be used each time that the
indexable is needed, because the documentation said "Important: The
translation is done quite frequently inside the container - each time
the rtree needs it".

Except, that isn't really the case.  What really happens is that the
rtree caches a temporary vector of the entire size incoming data.

pack_create.hpp, around line 155:
std::vector<entry_type> entries;
entries.reserve(values_count);

Where entry_type is a std::pair<point_type, InIt>, where
sizeof(point_type) is 24 and InIt is the incoming value iterator.

I also tried dynamically inserting nodes too, but that used even more
memory, although I didn't spend too much time looking at why.

So my question is, what is the purpose of the IndexableGetter?  Is there
a way to trade off speed for space such that the rtree uses an
IndexableGetter all the way through?  Was this by design?

Best,
Tim
_______________________________________________
Geometry mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/geometry
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 1.57 Storing indices as values to save space.

Tim Finer
Hi Adam,

Actually, I think what rtree is doing with the temporary vector in the
packing constructor should be considered a bug.

That problem is that the vector is created without regard to the rtree's
allocator - so even if a user gives the rtree a memory mapped file
allocator, the rtree creates a copy of the incoming data in memory,
ignoring the allocator.

Best,
Tim

On 11/26/14, 10:37 AM, Tim Finer wrote:

> Hi Adam,
>
> I'm trying yet another approach, heh.  This time with 1.57.
>
> I'm going to see if I can't just store the spatial data in a memory
> mapped vector and indices into it in an in-memory rtree, as long as I
> relax my requirement for the size of the data a bit.
>
> What I thought I'd try is storing the smallest sized unsigned integer
> index possible (given the data) and reference a shared memory that I
> control the layout of.  As a test, I set up an rtree with the value of
> std::uint32_t and passed in an IndexableGetter functor that returns an
> indexable double triplet when it receives the std::uint32_t value.  I
> used the packing constructor to minimize the memory footprint.  This
> all works, except, it looks like the rtree really stores a pair of an
> indexable point and a value and not just the values.
>
> I did some before and after memory snapshots and determined the size
> of the memory used is the same regardless if I set the value to an
> index, or if I fill the rtree with the double triplets directly.  My
> expectation was that the functor would be used each time that the
> indexable is needed, because the documentation said "Important: The
> translation is done quite frequently inside the container - each time
> the rtree needs it".
>
> Except, that isn't really the case.  What really happens is that the
> rtree caches a temporary vector of the entire size incoming data.
>
> pack_create.hpp, around line 155:
> std::vector<entry_type> entries;
> entries.reserve(values_count);
>
> Where entry_type is a std::pair<point_type, InIt>, where
> sizeof(point_type) is 24 and InIt is the incoming value iterator.
>
> I also tried dynamically inserting nodes too, but that used even more
> memory, although I didn't spend too much time looking at why.
>
> So my question is, what is the purpose of the IndexableGetter?  Is
> there a way to trade off speed for space such that the rtree uses an
> IndexableGetter all the way through?  Was this by design?
>
> Best,
> Tim

_______________________________________________
Geometry mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/geometry
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 1.57 Storing indices as values to save space.

Adam Wulkiewicz
In reply to this post by Tim Finer
Hi Tim,

Tim Finer wrote:
> Hi Adam,
>
> I'm trying yet another approach, heh.  This time with 1.57.
>
> I'm going to see if I can't just store the spatial data in a memory
> mapped vector and indices into it in an in-memory rtree, as long as I
> relax my requirement for the size of the data a bit.
>

How would this solve your problem?

> What I thought I'd try is storing the smallest sized unsigned integer
> index possible (given the data) and reference a shared memory that I
> control the layout of.  As a test, I set up an rtree with the value of
> std::uint32_t and passed in an IndexableGetter functor that returns an
> indexable double triplet when it receives the std::uint32_t value.  I
> used the packing constructor to minimize the memory footprint.  This
> all works, except, it looks like the rtree really stores a pair of an
> indexable point and a value and not just the values.
>
> I did some before and after memory snapshots and determined the size
> of the memory used is the same regardless if I set the value to an
> index, or if I fill the rtree with the double triplets directly.  My
> expectation was that the functor would be used each time that the
> indexable is needed, because the documentation said "Important: The
> translation is done quite frequently inside the container - each time
> the rtree needs it".
>
> Except, that isn't really the case.  What really happens is that the
> rtree caches a temporary vector of the entire size incoming data.
>
> pack_create.hpp, around line 155:
> std::vector<entry_type> entries;
> entries.reserve(values_count);
>
> Where entry_type is a std::pair<point_type, InIt>, where
> sizeof(point_type) is 24 and InIt is the incoming value iterator.
>

The IndexableGetter allows to retrieve an Indexable (Point, Box or
Segment) bound with the Value stored in an rtree. So in your case to
translate std::uint32_t to some Point or Box. It's used e.g. in line 211
indirectly by the translator(). A Value stored in the rtree is of a type
passed as the first template parameter so uint32_t.

Packing algorithm works in a top-down manner. It must have access to all
values and in the process it is sorting the entire set on each level.
This is why it must use some intermediate container to store data since
it can't sort an input range. The std::vector<entry_type> is this
temporary container storing pairs of a point (which is a centroid of
Indexable) and an original iterator to access the value later (line
210). The Point is stored to avoid retrieving an Indexable and
calculating its centroid each time an entry is compared during sorting.
Does it take too much memory in your case?

Unfortunately to fully support pack-creation of a persistent rtree we
should use a temporary persistent storage/file for all of the temporary
elements and use this storage/file for sorting.

I'm guessing that using the same allocator as in the case of an rtree
(and boost::container::vector<>) could somehow solve the problem as long
as not the whole mapped file must be loaded in memory, I'm not sure if
this is the case. Still, a lot more free space would be needed in a
mapped file.

> I also tried dynamically inserting nodes too, but that used even more
> memory, although I didn't spend too much time looking at why.

Inserting elements using balancing algorithm will result in creation of
more nodes than in the case of packing algorithm which creates a minimum
number of nodes for given Max and Min. This is why the tree is bigger.

> So my question is, what is the purpose of the IndexableGetter?  Is
> there a way to trade off speed for space such that the rtree uses an
> IndexableGetter all the way through?  Was this by design?

As explained above IndexableGetter is used as intended.

Regards,
Adam
_______________________________________________
Geometry mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/geometry
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 1.57 Storing indices as values to save space.

Adam Wulkiewicz
In reply to this post by Tim Finer
Hi Tim,

Tim Finer wrote:

> Hi Adam,
>
> Actually, I think what rtree is doing with the temporary vector in the
> packing constructor should be considered a bug.
>
> That problem is that the vector is created without regard to the
> rtree's allocator - so even if a user gives the rtree a memory mapped
> file allocator, the rtree creates a copy of the incoming data in
> memory, ignoring the allocator.
>

Not really a copy of incoming data. A calculated centroids and iterators
to the original data.

Yes, rtree's allocator could be passed to the vector (as mentioned in
the other email). Then iterators shouldn't be stored since they might be
not suitable for storing in shared memory. So probably a copy of actual
objects should be made here. And this could be done only if an allocator
was different than std::allocator<> so at least in this case it wouldn't
harm the performance. Also in the case of passing RandomAccessIterators
elements indexes could be stored instead of iterators.

But would it solve your problem?

Alterntively a second allocator could be passed for this temporary
container, so a second, temporary file could be used. This would
probably be the most convenient for you, am I right? However this would
require to complicate the interface (constructors) so I don't like it.

We should definietly think about it when implementing a "proper"
persistance support.

Regards,
Adam
_______________________________________________
Geometry mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/geometry
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 1.57 Storing indices as values to save space.

Tim Finer
Hi Adam,

On 11/26/14, 11:49 AM, Adam Wulkiewicz wrote:

> Hi Tim,
>
> Tim Finer wrote:
>> Hi Adam,
>>
>> Actually, I think what rtree is doing with the temporary vector in
>> the packing constructor should be considered a bug.
>>
>> That problem is that the vector is created without regard to the
>> rtree's allocator - so even if a user gives the rtree a memory mapped
>> file allocator, the rtree creates a copy of the incoming data in
>> memory, ignoring the allocator.
>>
>
> Not really a copy of incoming data. A calculated centroids and
> iterators to the original data.
Ah I see.  IndexableGetter must be the source for the centroid though,
right?  Can you use the IndexableGetter to calculate and sort on the
centroid without allocating a temporary vector of them?

>
> Yes, rtree's allocator could be passed to the vector (as mentioned in
> the other email). Then iterators shouldn't be stored since they might
> be not suitable for storing in shared memory. So probably a copy of
> actual objects should be made here. And this could be done only if an
> allocator was different than std::allocator<> so at least in this case
> it wouldn't harm the performance. Also in the case of passing
> RandomAccessIterators elements indexes could be stored instead of
> iterators.
Except I think the word "should" belongs there instead of "could". :)  
The whole point of passing in an allocator is for naught if the class
selectively uses it - it doesn't matter about shared memory allocators
or not.  All of the classes that I'm familiar with that use an allocator
use it completely (like the stdlib containers).

Step back from your knowledge of the implementation and think about this
as a user of the rtree.  Would you really expect the rtree to allocate
space from the heap, after giving it an allocator?

The rtree shouldn't second guess the intentions of the users here and
allocate memory from the heap for a temporary structure.  There are
plenty of important uses cases to use other kinds of allocators that
don't have anything to do with shared or mapped memory: debugging
allocators, small object allocators, memory pools, etc. What makes this
a bug is that it is a surprise.  It probably would have gone unnoticed
by me but for the operating system killing my testbed.  :)

The need for this heap allocated vector appears to be packing related,
would it make sense to expose the packing algorithm as a free function
like the std::transform in <algorithm> instead? It could even take a
callable object, just like transform does (it'd be nice to reuse the
IndexableGetter there, if that's what it needs).

rtree::pack( srcBegin, srcEnd, packedBegin, packedEnd );

where srcBegin and srcEnd are random access, const_iterators and
std::distance(srcBegin,srcEnd) == std::distance(packedBegin, packedEnd).

Or an output iterator:
rtree::pack<Type, Type, Type>( srcBegin, srcEnd, packedIter );

Then the user would be responsible for allocating the packed container
and the resultant packing what would be passed in as a ctor (or a
factory method that has some sort of name that says "packed"):

auto rtree =
bgi::rtree<ValueType,Partitioner,Indexer>make_packed_rtree(packedBegin,
packedEnd);


// Or, the constructor
auto rtree = bgi::rtree<ValueType,Partitioner,Indexer>(packedBegin,
packedEnd);

That gives the flexibility of allocation without compromising the API.  
I think it is cleaner to make a bulk loaded rtree construction more
explicit (vs. insert) and it removes the "surprise" of allocation that
packing requires.

>
> But would it solve your problem?

Yes.  My test of 50,000,000 points died because it ran out of memory
(this is a "small" test).  It was a big surprise to see that the rtree
needed to allocate outside of the allocator and not use the functor to
derive what it needed.  I'm not focused on persistence, that was a means
to an end.  My overall goal is to give my users the query ability of
data too large to fit in memory.  If the rtree only stored the ValueType
as implied in the documentation and only used the allocater, as implied
by the API...
>
> Alterntively a second allocator could be passed for this temporary
> container, so a second, temporary file could be used. This would
> probably be the most convenient for you, am I right? However this
> would require to complicate the interface (constructors) so I don't
> like it.
I agree that putting yet another allocator in the constructor makes it
even more complex than it should be.  Does the rtree really need to keep
the centroids of all of the data while it is packing? Couldn't those be
derived from calling the passed in IndexableGetter?  If not, then maybe
getting the centroid is fundamental enough to deserve its own functor,
or a some other way to model that supports indexing, finding the
centroid and equality?

If the packing algorithm really needs the temporary storage, then I
think that the constructor is doing too much and the functionality that
is packing should be made a free function and the rtree's constructor
made simpler.

>
> We should definietly think about it when implementing a "proper"
> persistance support.
Yeah, persistence support is another thing entirely.  I'm talking about
the happy case of memory based allocation right now - rtree is not
honoring its API contract with regard to allocators.  The harder problem
of persistence is another story, but yes, that would be grand to have
all that too!

I'll take a look to see if I can't figure out if any of my ideas make
any sense.  I don't know the domain well, but I do have a lot of C++
experience.  I'll let you know either way.

Best,
Tim
_______________________________________________
Geometry mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/geometry
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: 1.57 Storing indices as values to save space.

Adam Wulkiewicz
Hi Tim,

Tim Finer wrote:

> On 11/26/14, 11:49 AM, Adam Wulkiewicz wrote:
>> Tim Finer wrote:
>>> Actually, I think what rtree is doing with the temporary vector in
>>> the packing constructor should be considered a bug.
>>>
>>> That problem is that the vector is created without regard to the
>>> rtree's allocator - so even if a user gives the rtree a memory
>>> mapped file allocator, the rtree creates a copy of the incoming data
>>> in memory, ignoring the allocator.
>>>
>>
>> Not really a copy of incoming data. A calculated centroids and
>> iterators to the original data.
> Ah I see.  IndexableGetter must be the source for the centroid though,
> right?  Can you use the IndexableGetter to calculate and sort on the
> centroid without allocating a temporary vector of them?
>

I see no other way of effectively sorting a range of elements without a
temporary container, in general. Do you?

Instead of creating a container for all elements there could be M heaps
for elements.size() / M but this wouldn't solve anything.

FYI, actually elements aren't sorted. std::nth_element() is called M
times for each tree level.
M = MAX-1.

>>
>> Yes, rtree's allocator could be passed to the vector (as mentioned in
>> the other email). Then iterators shouldn't be stored since they might
>> be not suitable for storing in shared memory. So probably a copy of
>> actual objects should be made here. And this could be done only if an
>> allocator was different than std::allocator<> so at least in this
>> case it wouldn't harm the performance. Also in the case of passing
>> RandomAccessIterators elements indexes could be stored instead of
>> iterators.
> Except I think the word "should" belongs there instead of "could". :)
> The whole point of passing in an allocator is for naught if the class
> selectively uses it - it doesn't matter about shared memory allocators
> or not.  All of the classes that I'm familiar with that use an
> allocator use it completely (like the stdlib containers).
>

For (all) temporary containers? In STL? Could you give some example?

> Step back from your knowledge of the implementation and think about
> this as a user of the rtree.  Would you really expect the rtree to
> allocate space from the heap, after giving it an allocator?
>
> The rtree shouldn't second guess the intentions of the users here and
> allocate memory from the heap for a temporary structure. There are
> plenty of important uses cases to use other kinds of allocators that
> don't have anything to do with shared or mapped memory: debugging
> allocators, small object allocators, memory pools, etc. What makes
> this a bug is that it is a surprise.  It probably would have gone
> unnoticed by me but for the operating system killing my testbed.  :)

It definietly requires more thinking. In general, should all temporary
containers be using passed allocator? Also some temporary containers
used during nodes splitting (they're smaller). I think not but I may be
wrong.
The problem with this one is that it can be very big so probably should
be handled differently.

>
> The need for this heap allocated vector appears to be packing related,
> would it make sense to expose the packing algorithm as a free function
> like the std::transform in <algorithm> instead? It could even take a
> callable object, just like transform does (it'd be nice to reuse the
> IndexableGetter there, if that's what it needs).
>
> rtree::pack( srcBegin, srcEnd, packedBegin, packedEnd );
>
> where srcBegin and srcEnd are random access, const_iterators and
> std::distance(srcBegin,srcEnd) == std::distance(packedBegin, packedEnd).
>
> Or an output iterator:
> rtree::pack<Type, Type, Type>( srcBegin, srcEnd, packedIter );
>
> Then the user would be responsible for allocating the packed container
> and the resultant packing what would be passed in as a ctor (or a
> factory method that has some sort of name that says "packed"):
>
> auto rtree =
> bgi::rtree<ValueType,Partitioner,Indexer>make_packed_rtree(packedBegin, packedEnd);
>
>
>
> // Or, the constructor
> auto rtree = bgi::rtree<ValueType,Partitioner,Indexer>(packedBegin,
> packedEnd);
>

Isn't it the same how it's implemented?

But sure, additional method responsible for packing, allowing to
explicitly express that packing is desired could also be added. But I'm
thinking about non-static member function, similar to assign().

> That gives the flexibility of allocation without compromising the API.
> I think it is cleaner to make a bulk loaded rtree construction more
> explicit (vs. insert) and it removes the "surprise" of allocation that
> packing requires.
>

Then do you also propose to change the constructor?

>>
>> But would it solve your problem?
>
> Yes.  My test of 50,000,000 points died because it ran out of memory
> (this is a "small" test).  It was a big surprise to see that the rtree
> needed to allocate outside of the allocator and not use the functor to
> derive what it needed.  I'm not focused on persistence, that was a
> means to an end.  My overall goal is to give my users the query
> ability of data too large to fit in memory.  If the rtree only stored
> the ValueType as implied in the documentation and only used the
> allocater, as implied by the API...
>>
>> Alterntively a second allocator could be passed for this temporary
>> container, so a second, temporary file could be used. This would
>> probably be the most convenient for you, am I right? However this
>> would require to complicate the interface (constructors) so I don't
>> like it.
> I agree that putting yet another allocator in the constructor makes it
> even more complex than it should be.  Does the rtree really need to
> keep the centroids of all of the data while it is packing? Couldn't
> those be derived from calling the passed in IndexableGetter?  If not,
> then maybe getting the centroid is fundamental enough to deserve its
> own functor, or a some other way to model that supports indexing,
> finding the centroid and equality?
>

Of course it is "derived", first IndexableGetter is used, then a
centroid of an indexable is calculated. Do you expect to do it every
time two values need to be compared during sorting/nth_element?
Furthermore it doesn't solve the problem because you still can have so
many values that a container of indexes will be too big.

> If the packing algorithm really needs the temporary storage, then I
> think that the constructor is doing too much and the functionality
> that is packing should be made a free function and the rtree's
> constructor made simpler.
>

So I guess this is an answer for the above question, how would you
propose to change the ctor. I fear that users already using it and
expecting packing could dissagree.

>>
>> We should definietly think about it when implementing a "proper"
>> persistance support.
> Yeah, persistence support is another thing entirely.  I'm talking
> about the happy case of memory based allocation right now - rtree is
> not honoring its API contract with regard to allocators.  The harder
> problem of persistence is another story, but yes, that would be grand
> to have all that too!
>
> I'll take a look to see if I can't figure out if any of my ideas make
> any sense.  I don't know the domain well, but I do have a lot of C++
> experience.  I'll let you know either way.

Ok sure, all suggestions for making the library better are appreciated.

Currently I'm concerned the most about the allocator for this big
temporary container. I'm wondering if there is some kind of good practice.

Regards,
Adam
_______________________________________________
Geometry mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/geometry
Loading...