Skip to content

sklearn.neighbors.KDTree complexity for building is not O(n(k+log(n)) #7687

@MarDiehl

Description

@MarDiehl

Description

Building a kd-Tree can be done in O(n(k+log(n)) time and should (to my knowledge) not depent on the details of the data. However, the KDTree implementation in scikit-learn shows a really poor scaling behavior for my data. I cannot produce this behavior with data generated by sklearn.datasets.samples_generator.make_blobs

Steps/Code to Reproduce

download numpy data (search.npy) from https://webshare.mpie.de/index.php?6b4495f7e7 and run the following code on python 3

import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets.samples_generator import make_blobs
from scipy.spatial     import KDTree as KDTree_Scipy
from sklearn.neighbors import KDTree as KDTree_Sklearn
from time import sleep, perf_counter as pc

search_raw_artificial, dev_null = make_blobs(n_samples=6830160,n_features=5,centers=1000,cluster_std=0.4,random_state=0)
search_raw_real                 = np.load('search.npy')

N = [240000,2400000,4800000,6000000]

for search_raw in [search_raw_artificial,search_raw_real]:
  for i,n in enumerate(N):

    search = search_raw[0:n,:] 
    print('data shape',search.shape)
    print('delta',np.amax(search,0)-np.amin(search,0))
    t0 = pc()
    neigh = NearestNeighbors(algorithm='ball_tree').fit(search)
    print('sklearn.neighbors (ball_tree) build finished in {}s'.format(pc()-t0))
    t0 = pc()
    neigh = NearestNeighbors(algorithm='kd_tree').fit(search)
    print('  sklearn.neighbors (kd_tree) build finished in {}s'.format(pc()-t0))
    t0 = pc()
    neigh = KDTree_Sklearn(search)
    print('    sklearn.neighbors KD tree build finished in {}s'.format(pc()-t0))
    t0 = pc()
    neigh = KDTree_Scipy(search)
    print('    scipy.spatial     KD tree build finished in {}s'.format(pc()-t0))
    print('')

Expected Results

Time complexity scaling of scikit-learn KDTree should be similar to scaling of scipy.spatial KDTree

Actual Results

Artificial Data, scikit-learn outperforms scipy.spatial

data shape (240000, 5)
delta [ 22.7311549 22.61482157 22.57353059 22.65385101 22.77163478]
sklearn.neighbors (ball_tree) build finished in 0.1524970519822091s
sklearn.neighbors (kd_tree) build finished in 0.17296032601734623s
sklearn.neighbors KD tree build finished in 0.172917598974891s
scipy.spatial KD tree build finished in 2.265735782973934s

data shape (2400000, 5)
delta [ 23.38025743 23.22174801 22.88042798 22.8831237 23.31696732]
sklearn.neighbors (ball_tree) build finished in 3.462802237016149s
sklearn.neighbors (kd_tree) build finished in 3.7110973289818503s
sklearn.neighbors KD tree build finished in 3.5682168990024365s
scipy.spatial KD tree build finished in 19.92274082399672s

data shape (4800000, 5)
delta [ 23.38025743 23.26302877 23.22210673 22.97866792 23.31696732]
sklearn.neighbors (ball_tree) build finished in 8.922708058031276s
sklearn.neighbors (kd_tree) build finished in 9.238389031030238s
sklearn.neighbors KD tree build finished in 8.879073369025718s
scipy.spatial KD tree build finished in 38.43681587401079s

data shape (6000000, 5)
delta [ 23.42236957 23.26302877 23.22210673 23.20207953 23.31696732]
sklearn.neighbors (ball_tree) build finished in 12.75000820402056s
sklearn.neighbors (kd_tree) build finished in 13.30022174998885s
sklearn.neighbors KD tree build finished in 12.794657755992375s
scipy.spatial KD tree build finished in 48.33784791099606s

My Data, scipy.spatial outperforms scikit-learn for increasing data size

data shape (240000, 5)
delta [ 2.14487407 2.14472508 2.14499087 8.86612151 0.15491879]
sklearn.neighbors (ball_tree) build finished in 0.39374090504134074s
sklearn.neighbors (kd_tree) build finished in 0.21525143302278593s
sklearn.neighbors KD tree build finished in 0.21449304796988145s
scipy.spatial KD tree build finished in 2.244567967019975s

data shape (2400000, 5)
delta [ 2.14502773 2.14502543 2.14502904 8.86612151 1.59685522]
sklearn.neighbors (ball_tree) build finished in 4.199425678991247s
sklearn.neighbors (kd_tree) build finished in 4.40237572795013s
sklearn.neighbors KD tree build finished in 4.295626600971445s
scipy.spatial KD tree build finished in 26.322200270951726s

data shape (4800000, 5)
delta [ 2.14502773 2.14502864 2.14502904 8.86612151 3.19371044]
sklearn.neighbors (ball_tree) build finished in 110.31694995303405s
sklearn.neighbors (kd_tree) build finished in 112.8703724470106s
sklearn.neighbors KD tree build finished in 114.07325625402154s
scipy.spatial KD tree build finished in 51.79352715797722s

data shape (6000000, 5)
delta [ 2.14502838 2.14502902 2.14502914 8.86612151 3.99213804]
sklearn.neighbors (ball_tree) build finished in 2458.668528069975s
sklearn.neighbors (kd_tree) build finished in 2451.2438263060176s
sklearn.neighbors KD tree build finished in 2801.8054143560003s
scipy.spatial KD tree build finished in 62.066240190993994s

Comments

cKDTree from scipy.spatial behaves even better
I cannot use cKDTree/KDTree from scipy.spatial because calculating a sparse distance matrix (sparse_distance_matrix function) is extremely slow compared to neighbors.radius_neighbors_graph/neighbors.kneighbors_graph and I need a sparse distance matrix for DBSCAN on large datasets (n_samples >10 mio) with low dimensionality (n_features = 5 or 6)

Versions

Linux-4.7.6-1-ARCH-x86_64-with-arch
Python 3.5.2 (default, Jun 28 2016, 08:46:01) [GCC 6.1.1 20160602]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions