Spatial Indexes: build and knn query performance for bulk loading/packing vs. linear balancing

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Spatial Indexes: build and knn query performance for bulk loading/packing vs. linear balancing

Mikkel B. Stegmann
Hi All,

I'm having a hard time finding all of the expected benefits of the, otherwise very promising, bulk loading/packing feature added to the Boost v1.55beta.

For a synthetic setup of 1 million points placed in a square uniform grid I see the following characteristics for bulk loading vs. linear balancing (the query was knn; k==1):

Build time: 1.1 / 0.4 s
Query time: 4 / 5 us
Tree size: 12 / 30 MB

And for 100 million points:

Build time: 397 / 52 s
Query time: 9 / 7 us
Tree size: 1233 / 3046 MB

The environment was a 64-bit compile using gcc 4.4.7 running Ubuntu 12.04.

Although I would appreciate benefits w.r.t. query time similar to those illustrated for boxes in [1], this remains a secondary concern. (However, as the tree-size is markedly reduced I would assume that the improved tree structure would translate to improved measurements. But that might be related to the very simple spatial structure used for this synthetic benchmark, i.e. points in a uniform grid.)

My main concern is the marked increase in build time, which again is conflicting with the boxes case in [1]. As my usage really isn't dynamic, the bulk loading feature should be very attractive. But at this point, I'm not sure that I want to leverage the improvements in storage, at the expense of a dramatic increase in build time.

Do any of you have an idea of what is in play here?

(Graphs and other grid sizes are detailed in the attached PDFs for both cases.)

Best regards,
Mikkel B. Stegmann

[1] http://www.boost.org/doc/libs/1_54_0/libs/geometry/doc/html/geometry/spatial_indexes/introduction.html

Results_-_linear_16,_bulk.pdf
Results_-_linear_16.pdf
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Indexes: build and knn query performance for bulk loading/packing vs. linear balancing

Mateusz Loskot
Administrator
On 8 November 2013 08:55, Mikkel B. Stegmann
<[hidden email]> wrote:
> [...]

Mikkel,

I'm not answering the core issue, just one note to your PDFs.

The bulk loading is titled |linear 16,bulk".
AFAIK, the bulk loading algorithm in Boost.Geometry has nothing to do
with the linear,
they are two different algorithms. You either use linear (with related
parameters)
or you use bulk loading which is a custom algorithm based on, AFAIR,
Split-Tile-Recurse  (STR) and Top-down-Greedy-Split (TGS) and Adam's genius :-)

> [1]
> http://www.boost.org/doc/libs/1_54_0/libs/geometry/doc/html/geometry/spatial_indexes/introduction.html
> <http://www.boost.org/doc/libs/1_54_0/libs/geometry/doc/html/geometry/spatial_indexes/introduction.html>
>
> Results_-_linear_16,_bulk.pdf
> <http://boost-geometry.203548.n3.nabble.com/file/n4025741/Results_-_linear_16%2C_bulk.pdf>
> Results_-_linear_16.pdf
> <http://boost-geometry.203548.n3.nabble.com/file/n4025741/Results_-_linear_16.pdf>

I wonder, can you share the code of your benchmark?
I've been developing one too (with Adam's help)
https://github.com/mloskot/spatial_index_benchmark
I'd be curious to compare the results, improve my implementation, etc.

Best regards,
--
Mateusz  Loskot, http://mateusz.loskot.net
_______________________________________________
Geometry mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/geometry
--
Mateusz Loskot
http://mateusz.loskot.net
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Indexes: build and knn query performance for bulk loading/packing vs. linear balancing

Mikkel B. Stegmann
Hi Mateusz,

ah, I see. And agree. My labeling might be confusing. It solely relates to the template parameters given to the tree type that I have used for both experiments in my mock-up test code:

Type definition:

   typedef boost::geometry::index::rtree< const T*, boost::geometry::index::linear<16> > TreeType;
   boost::shared_ptr<TreeType> rTree_;

Bulk loading usage:

   rTree_ = boost::shared_ptr<TreeType>(new TreeType(elements.begin(), elements.end()));

Linear balancing usage:

   rTree_ = boost::shared_ptr<TreeType>(new TreeType());
   rTree_->insert(elements.begin(), elements.end());

Best regards,
Mikkel
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Indexes: build and knn query performance for bulk loading/packing vs. linear balancing

Mikkel B. Stegmann
Oh, and thanks for the link to your benchmarks. They seem like a valuable resource that I will consult right away.

Best regards,
Mikkel
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Indexes: build and knn query performance for bulk loading/packing vs. linear balancing

Adam Wulkiewicz
In reply to this post by Mikkel B. Stegmann
Hi Mikkel,

Mikkel B. Stegmann wrote:

> Hi Mateusz,
>
> ah, I see. And agree. My labeling might be confusing. It solely relates to
> the template parameters given to the tree type that I have used for both
> experiments in my mock-up test code:
>
> Type definition:
>
>     typedef boost::geometry::index::rtree< const T*,
> boost::geometry::index::linear<16> > TreeType;
>     boost::shared_ptr<TreeType> rTree_;
>
> Bulk loading usage:
>
>     rTree_ = boost::shared_ptr<TreeType>(new TreeType(elements.begin(),
> elements.end()));
>
> Linear balancing usage:
>
>     rTree_ = boost::shared_ptr<TreeType>(new TreeType());
>     rTree_->insert(elements.begin(), elements.end());
>

Thanks for the results of your benchmarks.

I've just tested 2d points using our benchmark and got different
results. The building is indeed slightly slower but not as much as in
your case (>2x). Furthermore, in my case knn query is 4x faster and box
query is 8x faster.

In case you'd like to run my benchmark on your machine. I'm using
libs/geometry/index/example/benchmark_experimental.cpp. Unfortunately I
was forced to modify it:
1. For points comment #define BOOST_GEOMETRY_INDEX_DETAIL_EXPERIMENTAL
2. Copy knn test to perform it also for bulk loading

If you're unable to send the code of your benchmark maybe you could
answer the following questions:

Probably the most important ones are:
What your data looks like exactly?
E.g. do you generate points row by row in a uniform grid with a constant
step?
Could you try to randomize points in the grid's area, instead of
generating a uniform grid?

But I'm also curious about the following if the above doesn't change
anything:
What is your T?
How big it is?
How do you access coordinates?
What do your indexable_getter looks like?
Could you store your points directly, not by use of pointer?

Regards,
Adam


_______________________________________________
Geometry mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/geometry
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Indexes: build and knn query performance for bulk loading/packing vs. linear balancing

Mikkel B. Stegmann
Hi Adam,

thanks for your input. I've now trimmed the test and converted it back to using actual boost 2d point instances. The build time differences remain the same. See pdf's and code below for details (the "SpatialIndexTrimmedUtilities.h" include is solely for the time and memory measurements done in the brief test code).

One thing worth noticing -- and this is entirely my bad -- the search times are improved compared to earlier. I (wrongfully) assumed that a run time of 5-10 us was fine for a single search test, as my timings were highly reproducible. However, there seems to be some initial cost of doing the first search, as the search time is much lower when looking at an average across 100 searches (which I do now). The difference between bulk/non-bulk however, remains subtle.

As the building times are my main concern, I will try to add some jitter to my uniform points and redo the test. (But, the problem is that my actual application is not that far from a fairly regular grid.)

Best regards,
Mikkel

01_131111_Results_-_linear_16.pdf
02_131111_Results_-_linear_16,_bulk.pdf
SpatialIndexTrimmedTest.cpp
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Indexes: build and knn query performance for bulk loading/packing vs. linear balancing

Adam Wulkiewicz
Hi Mikkel,

Mikkel B. Stegmann wrote:

> Hi Adam,
>
> thanks for your input. I've now trimmed the test and converted it back to
> using actual boost 2d point instances. The build time differences remain the
> same. See pdf's and code below for details (the
> "SpatialIndexTrimmedUtilities.h" include is solely for the time and memory
> measurements done in the brief test code).
>
> One thing worth noticing -- and this is entirely my bad -- the search times
> are improved compared to earlier. I (wrongfully) assumed that a run time of
> 5-10 us was fine for a single search test, as my timings were highly
> reproducible. However, there seems to be some initial cost of doing the
> first search, as the search time is much lower when looking at an average
> across 100 searches (which I do now). The difference between bulk/non-bulk
> however, remains subtle.
>
> As the building times are my main concern, I will try to add some jitter to
> my uniform points and redo the test. (But, the problem is that my actual
> application is not that far from a fairly regular grid.)
>

Yes, in your specific case the linear algorithm may be the best.

But to be honest the R-tree probably isn't the best spatial index type
in your case. If your points are generated uniformly (more or less) e.g.
one point per cell of some grid then you may predict exactly which
points must be checked. Maybe you don't need a spatial index at all?
Also you may create the spatial index using the info about all points
(you musn't insert or remove them later). I'd choose a different
datastructure in this case. However I understand that you don't want to
reinvent the wheel and just use some existing library.

You could consider implementing some simple spatial index. Which one?
This strongly depends on your needs.

If points are distributed across some grid and fill all cells, you might
store them in a Regular Grid, e.g. using vector as a storage (or vector
of vectors if you'd like to store more than one point in each cell) and
directly extract points from the area of interest.

If there are some empty areas where there are no points (and you'd like
to save some space) you might store them in a vector and sort them
according to the index of a cell on a Space Filling Curve (e.g.
http://en.wikipedia.org/wiki/Z-order_curve) which maps the n-d
coordinates to 1-d, then just find your points in this sorted order.

You could also go for KD-Tree classic variant in which points are stored
as nodes partitioning the space. This binary tree may be stored in e.g.
a vector where the node's children are stored at indexes 2*n and 2*n+1.
Since your points are uniformly distributed I'd go for a simplest
splitting algorithm using object median (std::nth_element) splitting
objects on a half (for next dimension) for each new level of nodes. The
idea is very simple (divide and conquer) and it would be easy to implement.

The choose also depends on what you need to get as an output. E.g. knn
queries are very simple to implement using KD-tree.

Regards,
Adam
_______________________________________________
Geometry mailing list
[hidden email]
http://lists.boost.org/mailman/listinfo.cgi/geometry
Reply | Threaded
Open this post in threaded view
|

Re: Spatial Indexes: build and knn query performance for bulk loading/packing vs. linear balancing

Mikkel B. Stegmann
Hi Adam,

thanks a lot for your valuable and detailed input (in particular the Morton code). Actually, R-trees weren't my first choice. But when I dived into the Boost documentation I quickly became convinced that this is way to go for now. My first first choice was in fact to build a simple spatial partitioning myself, and it surely would have been fun :)

But the spatial indexes offers readily available support for rectangles, in addition to various query features and future-proofing (for example w.r.t. less homogenous point distributions) way beyond what I personally could warrant to put into the project.

So in summary, my future needs extends well beyond knn-query of fairly regular points. I'll stick to linear balancing for now, and re-evaluate the facilities that spatial indexes provide, when I progress to more complex problems. Thanks for educating me on this matter.

Best regards,
Mikkel