T-SNE

tags:
  - dataviz
  - unsupervised
  - clustering

guide to TSNE : [https://distill.pub/2016/misread-tsne/]
openTSNE library : [https://opentsne.readthedocs.io/en/stable/examples/04_large_data_sets/04_large_data_sets.html]

Analyzing cluster structure

PCA gets the global structure preserved better than t-SNE. Global structure as in large groups, whereas t-SNE preserves smaller groups. This is due to the k-NN clustering being limited in what it samples to its neighbors.

Pasted image 20231019104315.png

On why we don't use it for clustering :

"There are reasons why t-sne is not used as a clustering algorithm.

First, as you point out yourself, that t-sne does not generate any cluster assignments. Instead, it performs dimensionality reduction, embedding the data into a low dimensional space that is easy to visualize. You could, of course, use a standard clustering algorithm such as k-means on this embedding to get clusters. However, if the clusters exist in the data, you should not need to map it to 2D first.

Secondly, and this is quite crucial, t-sne may create embedding containing clusters that don't really exist in the real data. Also, it may disregard clusters that do exist in the real data. Depending on the randomness in the algorithm and chosen hyperparameters, you may get very different results. I recommend reading the article How to Use t-SNE Effectively discussing some of these unexpected scenarios. Another, related issue is reproducibility of t-sne results.

Finally, if there are real clusters in your data, you should be able to find them using standard, well understood clustering algorithms. Using methods that people understand gives way more credibility to your results and makes it easier to interpret them." [https://stats.stackexchange.com/questions/447236/can-t-sne-be-directly-used-as-a-clustering-algorithm]

What we can do is use it to visualize the features captured by our model and visually inspect the quality of the decision boundaries for image classification.
To do that we can extract the output of intermediate NN activations, n-dimensional vectors that are projected to 2D using t-SNE. For example, the last activations before the classification layers can give us insight on the learned representation of the network.

Pasted image 20231019140742.png

Running T-SNE or UMAP on GPU

To get the most out of our machines, it is usually faster to run these models on GPUs. To do this, you can use the Rapids AI library running on a WSL2 Ubuntu 22.04+Windows 10 subsystem and following the guide written in Running jupyter or IDE on WSL2 .
The library contains cuML which is a sklearn API imitation, reproducing most algorithms but implemented with a CUDA compatible manner. This includes t-SNE and UMAP aswell as PCA for dimensionality reduction. There is also a way to implement RF and XGBoost using the GPU.