Topics

Persistent Homology on high-dimensional datasets

Tommaso Salvatori
 

Hi!
I'm new to Dionysus (and to coding in general), so this question may be trivial. I need to compute persistent homology for a high dimensional dataset ( d ~ 1000) embedded in a vector space. I'd only need to see the barcode diagram for H0 and H1, but I don't know if it is possible having such high dimensional data.
So the question is, can I do it?
If YES,  how should I do it/which function should I use?

having something I can just copy and paste would be amazing.

Thank you very much!

Dmitriy Morozov
 

Vietoris-Rips complexes would work in high ambient dimension. Your bigger constraint would be the number of points. Here they are described in the documentation:


On Thu, Nov 15, 2018 at 8:50 AM Tommaso Salvatori <systemofshooter@...> wrote:
Hi!
I'm new to Dionysus (and to coding in general), so this question may be trivial. I need to compute persistent homology for a high dimensional dataset ( d ~ 1000) embedded in a vector space. I'd only need to see the barcode diagram for H0 and H1, but I don't know if it is possible having such high dimensional data.
So the question is, can I do it?
If YES,  how should I do it/which function should I use?

having something I can just copy and paste would be amazing.

Thank you very much!

Tommaso Salvatori
 

Do you mean that in order to get a nice result the number of points has to be much greater than d? I have 10/20 thousand points in my dataset (two classes of the MNIST dataset). would that work?
Or would you suggest me to reduce the dimension in some way (PCA, Mapper)? 

Thank you very very much for the answer!

Kowshik Thopalli
 

I think what he meant is , it is generally not the dimensions that are a problem but the number of samples.  Even if your dimension is high, you can precompute the pairwise distance matrix  using faster approaches such as in scipy and pass that.  
In your case it is 20000 points that are a problem, not your 1000 dimensions. So no need of dimensionality reduction as such even though that doesn't hurt.
Look at example 2 here to pass a precomputed distance matrix. http://mrzv.org/software/dionysus2/tutorial/rips.html 


On Nov 15, 2018 11:09 AM, "Tommaso Salvatori" <systemofshooter@...> wrote:
Do you mean that in order to get a nice result the number of points has to be much greater than d? I have 10/20 thousand points in my dataset (two classes of the MNIST dataset). would that work?
Or would you suggest me to reduce the dimension in some way (PCA, Mapper)? 

Thank you very very much for the answer!


Dmitriy Morozov
 

Just to follow up, Kowshik is right: I was talking about the fact that the complexity of these complexes depends on the number of points, and 20,000 may be too much. I should point out that Uli Bauer has a software called Ripser that is very fast at computing persistence for Vietoris-Rips complexes, and it may be a good choice in your case. (I plan to integrate Ripser into Dionysus over the winter break, but for now it's not there.)

Two more points. In Dionysus 2, there should be no benefit to precomputing the distance matrix. If there is, I'd like to know about it. Reducing dimensionality may be a good idea, independently of everything said so far: it would help cope with noise. (And of course in lower dimension the distance computation is faster.)

On Thu, Nov 15, 2018 at 10:46 AM Kowshik Thopalli <kthopall@...> wrote:
I think what he meant is , it is generally not the dimensions that are a problem but the number of samples.  Even if your dimension is high, you can precompute the pairwise distance matrix  using faster approaches such as in scipy and pass that.  
In your case it is 20000 points that are a problem, not your 1000 dimensions. So no need of dimensionality reduction as such even though that doesn't hurt.
Look at example 2 here to pass a precomputed distance matrix. http://mrzv.org/software/dionysus2/tutorial/rips.html 


On Nov 15, 2018 11:09 AM, "Tommaso Salvatori" <systemofshooter@...> wrote:
Do you mean that in order to get a nice result the number of points has to be much greater than d? I have 10/20 thousand points in my dataset (two classes of the MNIST dataset). would that work?
Or would you suggest me to reduce the dimension in some way (PCA, Mapper)? 

Thank you very very much for the answer!

Tommaso Salvatori
 

Ok, I think I have managed to do everything. 
Once I have the filtration, how can I compute and print the persistence barcode/diagram?

Thank you very much again

Kowshik Thopalli
 

Thank you Dimitry for the inputs. I have never measured the time with precomputed distance matrix vs dionysus.  

Tommaso-
To compute pd after filtration and plotting the diagram -http://mrzv.org/software/dionysus2/tutorial/plotting.html

In general , here are the basics- Strongly suggest you to go through these basics and other areas of the tutorial-
These are well documented.

On Fri, Nov 16, 2018 at 9:14 AM Tommaso Salvatori <systemofshooter@...> wrote:
Ok, I think I have managed to do everything. 
Once I have the filtration, how can I compute and print the persistence barcode/diagram?

Thank you very much again