-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
Description
Description
Building a kd-Tree can be done in O(n(k+log(n)) time and should (to my knowledge) not depent on the details of the data. However, the KDTree implementation in scikit-learn shows a really poor scaling behavior for my data. I cannot produce this behavior with data generated by sklearn.datasets.samples_generator.make_blobs
Steps/Code to Reproduce
download numpy data (search.npy) from https://webshare.mpie.de/index.php?6b4495f7e7 and run the following code on python 3
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets.samples_generator import make_blobs
from scipy.spatial import KDTree as KDTree_Scipy
from sklearn.neighbors import KDTree as KDTree_Sklearn
from time import sleep, perf_counter as pc
search_raw_artificial, dev_null = make_blobs(n_samples=6830160,n_features=5,centers=1000,cluster_std=0.4,random_state=0)
search_raw_real = np.load('search.npy')
N = [240000,2400000,4800000,6000000]
for search_raw in [search_raw_artificial,search_raw_real]:
for i,n in enumerate(N):
search = search_raw[0:n,:]
print('data shape',search.shape)
print('delta',np.amax(search,0)-np.amin(search,0))
t0 = pc()
neigh = NearestNeighbors(algorithm='ball_tree').fit(search)
print('sklearn.neighbors (ball_tree) build finished in {}s'.format(pc()-t0))
t0 = pc()
neigh = NearestNeighbors(algorithm='kd_tree').fit(search)
print(' sklearn.neighbors (kd_tree) build finished in {}s'.format(pc()-t0))
t0 = pc()
neigh = KDTree_Sklearn(search)
print(' sklearn.neighbors KD tree build finished in {}s'.format(pc()-t0))
t0 = pc()
neigh = KDTree_Scipy(search)
print(' scipy.spatial KD tree build finished in {}s'.format(pc()-t0))
print('')
Expected Results
Time complexity scaling of scikit-learn KDTree should be similar to scaling of scipy.spatial KDTree
Actual Results
Artificial Data, scikit-learn outperforms scipy.spatial
data shape (240000, 5)
delta [ 22.7311549 22.61482157 22.57353059 22.65385101 22.77163478]
sklearn.neighbors (ball_tree) build finished in 0.1524970519822091s
sklearn.neighbors (kd_tree) build finished in 0.17296032601734623s
sklearn.neighbors KD tree build finished in 0.172917598974891s
scipy.spatial KD tree build finished in 2.265735782973934s
data shape (2400000, 5)
delta [ 23.38025743 23.22174801 22.88042798 22.8831237 23.31696732]
sklearn.neighbors (ball_tree) build finished in 3.462802237016149s
sklearn.neighbors (kd_tree) build finished in 3.7110973289818503s
sklearn.neighbors KD tree build finished in 3.5682168990024365s
scipy.spatial KD tree build finished in 19.92274082399672s
data shape (4800000, 5)
delta [ 23.38025743 23.26302877 23.22210673 22.97866792 23.31696732]
sklearn.neighbors (ball_tree) build finished in 8.922708058031276s
sklearn.neighbors (kd_tree) build finished in 9.238389031030238s
sklearn.neighbors KD tree build finished in 8.879073369025718s
scipy.spatial KD tree build finished in 38.43681587401079s
data shape (6000000, 5)
delta [ 23.42236957 23.26302877 23.22210673 23.20207953 23.31696732]
sklearn.neighbors (ball_tree) build finished in 12.75000820402056s
sklearn.neighbors (kd_tree) build finished in 13.30022174998885s
sklearn.neighbors KD tree build finished in 12.794657755992375s
scipy.spatial KD tree build finished in 48.33784791099606s
My Data, scipy.spatial outperforms scikit-learn for increasing data size
data shape (240000, 5)
delta [ 2.14487407 2.14472508 2.14499087 8.86612151 0.15491879]
sklearn.neighbors (ball_tree) build finished in 0.39374090504134074s
sklearn.neighbors (kd_tree) build finished in 0.21525143302278593s
sklearn.neighbors KD tree build finished in 0.21449304796988145s
scipy.spatial KD tree build finished in 2.244567967019975s
data shape (2400000, 5)
delta [ 2.14502773 2.14502543 2.14502904 8.86612151 1.59685522]
sklearn.neighbors (ball_tree) build finished in 4.199425678991247s
sklearn.neighbors (kd_tree) build finished in 4.40237572795013s
sklearn.neighbors KD tree build finished in 4.295626600971445s
scipy.spatial KD tree build finished in 26.322200270951726s
data shape (4800000, 5)
delta [ 2.14502773 2.14502864 2.14502904 8.86612151 3.19371044]
sklearn.neighbors (ball_tree) build finished in 110.31694995303405s
sklearn.neighbors (kd_tree) build finished in 112.8703724470106s
sklearn.neighbors KD tree build finished in 114.07325625402154s
scipy.spatial KD tree build finished in 51.79352715797722s
data shape (6000000, 5)
delta [ 2.14502838 2.14502902 2.14502914 8.86612151 3.99213804]
sklearn.neighbors (ball_tree) build finished in 2458.668528069975s
sklearn.neighbors (kd_tree) build finished in 2451.2438263060176s
sklearn.neighbors KD tree build finished in 2801.8054143560003s
scipy.spatial KD tree build finished in 62.066240190993994s
Comments
cKDTree from scipy.spatial behaves even better
I cannot use cKDTree/KDTree from scipy.spatial because calculating a sparse distance matrix (sparse_distance_matrix function) is extremely slow compared to neighbors.radius_neighbors_graph/neighbors.kneighbors_graph and I need a sparse distance matrix for DBSCAN on large datasets (n_samples >10 mio) with low dimensionality (n_features = 5 or 6)
Versions
Linux-4.7.6-1-ARCH-x86_64-with-arch
Python 3.5.2 (default, Jun 28 2016, 08:46:01) [GCC 6.1.1 20160602]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18