Obsidian Vault

Setting up LM Studio

Tue, 31 Mar 2026 11:02:25 GMT

As of current versions, LM Studio does not have a built-in proxy settings menu within the application GUI. However, you can usually bypass this by using system environment variables or manual "sideloading."LM Studio, like many Electron-based apps, often inherits the proxy settings from your operating system's environment variables.For Windows: Search for "Edit the system environment variables" in the Start menu. Click Environment Variables. Under User variables, click New and add the following: Variable name: HTTP_PROXY Variable value: http://your-proxy-address:port (e.g., http://proxy.company.com:8080) Repeat for HTTPS_PROXY. Restart LM Studio completely (close it from the system tray as well). For macOS / Linux:Launch LM Studio from your terminal with the proxy variables prepended:Bashexport http_proxy=http://your-proxy-address:port export https_proxy=http://your-proxy-address:port /Applications/LM\ Studio.app/Contents/MacOS/LM\ Studio If the proxy variables don't work (common if your company uses a "Man-in-the-Middle" SSL inspection that LM Studio doesn't trust), you can download the models manually using your browser, which already has the proxy configured. Download the Model: Go to Hugging Face and find the .gguf file you want. Locate LM Studio's Folder: * Windows: C:\Users\\.cache\lm-studio\models Mac/Linux: ~/.cache/lm-studio/models Place the File: Create a subfolder inside models named after the creator (e.g., TheBloke) and then another folder for the model name. Place your .gguf file there. Refresh: Open LM Studio and go to the "My Models" tab (the folder icon); it should now appear there without needing a connection.

Data Leakage

Tue, 31 Mar 2026 10:58:59 GMT

We have two sources of data leak :Normalization data leak : applied after train/test split, fit on train, transform test
PCA data leak

PCA data leak

Tue, 31 Mar 2026 10:58:36 GMT

Part of a bigger Data Leakage problemYou should perform PCA AFTER the train/test split, not before. Here's why:When you fit PCA on the entire dataset before splitting: The PCA transformation learns patterns from the test set The principal components "know about" the test data distribution Your model gets inflated performance metrics because the features were optimized using information from data it will later "see" This is data leakage, even though the test samples themselves aren't in the training set.This is actually a much better scenario. If you have: One PCA on 7 variables for one variable group Another PCA on 7 variables for a different variable group And you're fitting both after the train/test split, the leakage is minimal to negligible because:✅ Each PCA is fit only on the training data ✅ The transformation is then applied to test data (transform only, no fit) ✅ You have two independent PCAs, which is fine as long as each is separately fitted on train data onlyLeakage magnitude: Nearly zero if done correctly.The same principle applies—actually even more strictly:# ❌ WRONG: PCA before splitting pca = PCA(n_components=2) X_transformed = pca.fit_transform(X) # Fitted on ENTIRE dataset X_train, X_test = train_test_split(X_transformed, ...) # ✅ CORRECT: PCA inside cross-validation loop for train_idx, val_idx in cross_validator.split(X, y): X_train_fold = X[train_idx] X_val_fold = X[val_idx] pca = PCA(n_components=2) X_train_transformed = pca.fit_transform(X_train_fold) X_val_transformed = pca.transform(X_val_fold) # Only transform, don't fit With cross-validation, fit PCA on each fold's training data only, then transform the validation fold.# 1. Split first X_train, X_test, y_train, y_test = train_test_split(X, y, ...) # 2. Fit separate scalers and PCAs on TRAINING data only scaler_train = StandardScaler() X_train_scaled = scaler_train.fit_transform(X_train) pca_1 = PCA(n_components=2) pca_1_train = pca_1.fit_transform(X_train_scaled[:, group_1_indices]) # 3. Apply SAME transformers to test data (transform only) X_test_scaled = scaler_train.transform(X_test) # Use train's scaler pca_1_test = pca_1.transform(X_test_scaled[:, group_1_indices]) # Use train's PCA

Ensemble of Ensembles

Tue, 31 Mar 2026 10:56:36 GMT

cf rapport Mistral : Ensemble Methods in Machine Learning Theoretical Foundations and Practical Applications in Habitat Mapping.pdf
Les modèles d'ensembles (Ensemble methods) sont des modèles de Machine Learning composé de sous-modèles qui votent, menant à une décision finale de la majorité (ou autres systèmes de votes).
Le théorème fondamental de la Décomposition du Biais variance (Geman et al. 1992, Domingos 2000) établit que l’erreur d’un modèle peut se décomposer en une somme du biais au carré et de la variance (cf wikipedia ):

Les modèles individuels peuvent avoir un biais élevé (sous-apprentissage) ou une variance élevée (surapprentissage). Les ensembles permettent de combiner plusieurs modèles pour réduire ces deux composantes.Le choix est subjectif selon les données et la complexité du problème. Ensemble hétérogène peut être plus adapté pour problèmes complexes.
Le Stacking (ou stacked generalization) est une méthode avancée d'ensemble (ensemble method) où un méta-modèle (ou meta-learner) est entraîné pour combiner les prédictions de plusieurs modèles de base (base learners). Contrairement au bagging (ex : Random Forest) ou au boosting (ex : XGBoost), qui utilisent des règles de combinaison simples (vote majoritaire, moyenne pondérée), le stacking utilise un modèle supérieur pour apprendre la meilleure façon de fusionner les prédictions des modèles de base. Fondement : Basés sur la loi des grands nombres : en moyennant les prédictions de modèles similaires (ex : arbres de décision), la variance de l'erreur diminue. Réduction de la variance : Chaque modèle est entraîné sur un sous-ensemble des données (bagging) ou pondéré par ses erreurs (boosting). Théorie :
Breiman Random forest paper (1996) a prouvé que le bagging réduit la variance sans augmenter le biais.
Friedman (2001) a montré que le Gradient Boosting réduit le biais en se concentrant sur les erreurs des modèles précédents. Limites : Biais commun à tous les modèles (ex : biais des arbres de décision). Moins efficace sur des données hétérogènes ou multi-sources. Fondement : Basés sur la diversité des biais : chaque modèle capture des motifs différents (ex : SVM pour les marges larges, réseaux de neurones pour les relations complexes).
théorème de la diversité (Krogh & Vedelsby, 1995) : L'erreur d'un ensemble est minimisée si les modèles sont divers et peu corrélés. Théorie :
Dietterich (2000) a montré que la diversité est cruciale pour surpasser les modèles individuels "It was shown that apart from getting the individual members of the ensemble to generalize well, it is important for generalization that the individuals disagrees as much as possible" Les ensembles hétérogènes exploitent la complémentarité des modèles pour réduire à la fois le biais et la variance. Avantages : Meilleure généralisation sur des données complexes ou multi-sources. Robustesse accrue face au bruit et aux déséquilibres de classes. Dietterich (2000) ne traite pas des ensembles hétérogènes : ses conclusions s'appliquent aux ensembles d'arbres de décision (homogènes). Pour les ensembles hétérogènes, il faut s'appuyer sur : La complémentarité des modèles (ex : SVM + RF + NN). Des métriques de diversité (Q-statistic, Double Fault). Une calibration des sorties pour les rendre comparables. Les références comme Kuncheva & Whitaker (2003) ou Waske & van der Linden (2008) sont plus adaptées pour justifier l'usage d'ensembles hétérogènes. Risques des ensembles hétérogènes : Incompatibilité des sorties : Certains modèles produisent des probabilités mal calibrées (ex : SVM vs. RF). Complexité accrue : La combinaison de modèles hétérogènes peut introduire du bruit ou des conflits si les biais ne sont pas complémentaires. Manque de preuves formelles : Contrairement aux ensembles homogènes (où des théorèmes comme celui de Breiman ou Krogh & Vedelsby s'appliquent), les ensembles hétérogènes reposent davantage sur des observations empiriques que sur des garanties théoriques solides. Ce que dit vraiment la littérature : Pour les ensembles hétérogènes, des travaux comme ceux de Kuncheva & Whitaker (2003) ou Brown et al. (2005) sont plus pertinents. Ces études montrent que la diversité est nécessaire mais pas suffisante : il faut aussi que les modèles soient complémentaires (ex : un modèle fort en précision, un autre en recall).

Stacking

Tue, 31 Mar 2026 10:56:29 GMT

Le stacking (ou stacked generalization) est une méthode avancée d'ensemble (ensemble method) où un méta-modèle (ou meta-learner) est entraîné pour combiner les prédictions de plusieurs modèles de base (base learners). Contrairement au bagging (ex : Random Forest) ou au boosting (ex : XGBoost), qui utilisent des règles de combinaison simples (vote majoritaire, moyenne pondérée), le stacking utilise un modèle supérieur pour apprendre la meilleure façon de fusionner les prédictions des modèles de base. Entraînement des modèles de base : Les données sont divisées en k folds (validation croisée). Chaque modèle de base est entraîné sur k−1 folds et prédit sur le fold restant. Les prédictions sur les folds restants forment un nouvel ensemble de données (méta-données). Entraînement du méta-modèle : Le méta-modèle est entraîné sur les méta-données (prédictions des modèles de base). Il apprend à pondérer ou combiner ces prédictions pour minimiser l'erreur globale. Prédiction finale : Les modèles de base font des prédictions sur de nouvelles données. Le méta-modèle combine ces prédictions pour produire la prédiction finale. Modèles de base : Random Forest (pour les variables tabulaires), SVM (pour les relations non-linéaires), et un réseau de neurones convolutionnel (pour les images satellites). Méta-modèle : Régression logistique ou XGBoost. Résultat : Meilleure précision dans la classification des habitats complexes (ex : zones humides, forêts fragmentées). Théorie : Wolpert (1992) a démontré que le stacking peut réduire l'erreur de généralisation en apprenant une combinaison optimale des prédictions des modèles de base. Preuve : Le méta-modèle minimise l'erreur quadratique moyenne (MSE) ou l'entropie croisée, en exploitant les forces complémentaires des modèles de base. Biais et Variance : Le stacking réduit le biais en combinant des modèles aux biais différents (ex : SVM vs. Random Forest). Il réduit la variance en utilisant un méta-modèle pour lisser les prédictions des modèles de base. Théorème de la diversité (Krogh & Vedelsby, 1995) : L'erreur d'un ensemble est bornée par l'erreur moyenne des modèles et leur diversité. Le stacking maximise cette diversité en apprenant une combinaison optimale.

théorème de la diversité

Mon, 30 Mar 2026 15:41:49 GMT

Le "théorème de la diversité" (ou diversity theorem) est issu de l'article fondateur de Krogh et Vedelsby (1995) intitulé :"Neural Network Ensembles, Cross Validation, and Active Learning" Auteurs : A. Krogh et J. A. Vedelsby Conférence : Advances in Neural Information Processing Systems (NIPS 1995) Lien vers l'article : NIPS 1995 Proceedings (p. 231-238) Krogh et Vedelsby ont démontré que l'erreur quadratique moyenne (MSE) d'un ensemble de modèles est égale à la moyenne des erreurs des modèles individuels moins un terme appelé "ambiguïté" (ou ambiguity), qui mesure la diversité entre les modèles. Mathématiquement : : Moyenne des erreurs des modèles de base. A : Ambiguïté, qui quantifie la diversité des prédictions des modèles. Interprétation : Plus les modèles sont divers (c'est-à-dire que leurs erreurs sont peu corrélées), plus l'erreur de l'ensemble est réduite. Ce théorème justifie l'utilisation d'ensembles hétérogènes ou de méthodes comme le bagging et le boosting pour améliorer la performance.

Regularization

Mon, 30 Mar 2026 15:10:32 GMT

IBM article SourceRegularization is a set of methods for reducing overfitting in machine learning models. Typically, regularization trades a marginal decrease in training accuracy for an increase in generalizability.
Regularization encompasses a range of techniques to correct for overfitting in machine learning models. As such, regularization is a method for increasing a model’s generalizability—that is, it’s ability to produce accurate predictions on new datasets.1 Regularization provides this increased generalizability at the sake of increased training error. In other words, regularization methods typically lead to less accurate predictions on training data but more accurate predictions on test data.Regularization differs from optimization. Essentially, the former increases model generalizability while the latter increases model training accuracy. Both are important concepts in machine learning and data science.There are many forms of regularization. Anything in the way of a complete guide requires a much longer book-length treatment. Nevertheless, this article provides an overview of the theory necessary to understand regularization’s purpose in machine learning as well as a survey of several popular regularization techniques.
see : Bias Variance Tradeoff
Regularization techniques are essential for preventing overfitting and improving the generalization of neural networks. Here are some state-of-the-art regularization techniques that are easy to implement: Dropout: Description: Randomly sets a fraction of input units to zero at each update during training time, which helps prevent overfitting. Implementation: Add a dropout layer with a specified dropout rate (e.g., 0.5) after dense layers.
L1 Regularization (Lasso): Description: Adds a penalty equal to the absolute value of the magnitude of coefficients to the loss function. Implementation: Add L1 regularization to the kernel or bias of dense layers.
L2 Regularization (Ridge): Description: Adds a penalty equal to the square of the magnitude of coefficients to the loss function. Implementation: Add L2 regularization to the kernel or bias of dense layers. Early Stopping: Description: Stops training when the performance on a validation set starts to degrade, indicating overfitting. Implementation: Use a callback in Keras to monitor validation loss and stop training when it stops improving.
Batch Normalization: Description: Normalizes the inputs of each layer to have a mean of zero and a variance of one, which helps stabilize and accelerate training.
Implementation: Add a batch normalisation layer after dense layers. Data Augmentation: Description: Generates new training samples by applying random transformations to the existing data, which helps the model generalize better. Implementation: Use data augmentation techniques like rotation, scaling, and flipping for image data. Weight Regularization: Description: Applies regularization directly to the weights of the network, encouraging smaller weights. Implementation: Use kernel regularizers in dense layers. Learning Rate Scheduling: Description: Adjusts the learning rate during training to improve convergence and generalization. Implementation: Use learning rate schedulers or adaptive learning rate optimizers like Adam. Ensemble Methods: Description: Combines predictions from multiple models to improve performance and reduce overfitting. Implementation: Train multiple models with different architectures or initializations and average their predictions. Noise Injection: Description: Adds random noise to the inputs or hidden layers during training to improve robustness. Implementation: Add Gaussian noise layers to the network.

Bias Variance Tradeoff

Mon, 30 Mar 2026 15:10:15 GMT

This concession of increased training error for decreased testing error is known as bias-variance tradeoff. Bias-variance tradeoff is a well-known problem in machine learning. It’s necessary to first define “bias” and “variance.” To put it briefly:- Bias measures the average difference between predicted values and true values. As bias increases, a model predicts less accurately on a training dataset. High bias refers to high error in training.- Variance measures the difference between predictions across various realizations of a given model. As variance increases, a model predicts less accurately on unseen data. High variance refers to high error during testing and validation.
Bias and variance thus inversely represent model accuracy on training and test sets respectively. Obviously, developers aim to reduce both model bias and variance. Simultaneous reduction in both is not always possible, resulting in the need for regularization. Regularization decreases model variance at the cost of increased bias.

Pasted image 20260330170858

Mon, 30 Mar 2026 15:08:58 GMT

Ensemble methods

Mon, 30 Mar 2026 15:05:39 GMT

XGBoost probability map

Thu, 26 Mar 2026 15:40:38 GMT

XGBoost can also produce probability maps (or probability estimates) for classification tasks, much like Random Forest. Here’s how it works and how it compares: For Binary Classification: XGBoost outputs a raw score (log-odds) for each sample, which is then transformed into a probability using the logistic (sigmoid) function: P(y=1)=11+e−(raw score) P(y=1) = \frac{1}{1 + e^{-(\text{raw score})}}P(y=1)=1+e−(raw score)1This gives you the probability that the sample belongs to the positive class (e.g., class "1"). For Multi-Class Classification: XGBoost uses the softmax function to convert raw scores into probabilities for each class, ensuring they sum to 1: P(y=i)=eraw scorei∑j=1Keraw scorej P(y=i) = \frac{e^{\text{raw score}i}}{\sum{j=1}^{K} e^{\text{raw score}_j}}P(y=i)=∑j=1Keraw scorejeraw scoreiwhere KKK is the number of classes. Input: Like Random Forest, XGBoost can take spatial features (e.g., satellite imagery bands, elevation, texture metrics) as input for each pixel or spatial unit. Output: For each pixel, XGBoost predicts the probability of belonging to a specific class (e.g., "forest," "urban," "water"). Result: You can visualize these probabilities as a probability map, where each pixel’s value represents the confidence of the model in its prediction. Advantages: Often achieves higher accuracy than Random Forest, especially with structured/tabular data. Handles large datasets efficiently (scalable with GPU support). Can incorporate regularization (e.g., L1/L2) to avoid overfitting. Disadvantages: Less interpretable than Random Forest (harder to debug). Requires careful tuning of hyperparameters (e.g., learning_rate, n_estimators). Suppose you’re mapping land cover classes (e.g., "forest," "grassland," "urban") using satellite imagery: Features: Pixel values from spectral bands (e.g., NDVI, NIR, Red). Training: XGBoost learns to predict the class probabilities for each pixel. Output: A probability map for each class, where each pixel shows the likelihood of belonging to "forest," "grassland," etc. In Python (using xgboost or scikit-learn API), you can directly predict probabilities: predict_proba outputs a matrix of shape (n_samples, n_classes), where each row sums to 1. Use tools like Matplotlib, Rasterio, or QGIS to plot the probabilities spatially. Example: A heatmap where color intensity represents the probability of "forest" for each pixel. Using multi:softmax in XGBoost will still work for generating probability maps, but there are important differences compared to multi:softprob that you should be aware of. Here’s what you need to know: Technically, yes, but not directly. When you train XGBoost with multi:softmax, it internally uses the softmax function to convert raw scores into probabilities during training. However, by default, the predict method returns only the class with the highest probability (not the probabilities themselves). To get probabilities with multi:softmax, you need to explicitly call predict_proba (if using the scikit-learn API) or set output_margin=False in the native XGBoost API. The predict_proba method will return probabilities even if the objective is multi:softmax, because the scikit-learn API handles this internally. If you’re using the native XGBoost API (e.g., xgb.train), you need to: Set objective="multi:softmax". Use pred_proba=True or manually apply softmax to the raw scores: Convenience: multi:softprob is designed to output probabilities directly, so you don’t need to manually apply softmax. Calibration: Probabilities from multi:softprob are often better calibrated (closer to true probabilities) because the objective explicitly optimizes for probability estimation. Clarity: It makes your intent clear (probability estimation vs. class prediction). If you’re generating probability maps (e.g., for land cover classification), both multi:softmax and multi:softprob can work, but: multi:softprob is simpler and more reliable for probability outputs. multi:softmax requires extra steps (e.g., manual softmax) if you’re not using the scikit-learn API. The quality of the probabilities may differ slightly due to differences in optimization. Use multi:softmax if you only care about the predicted class (not probabilities). Use multi:softprob if you need well-calibrated probabilities (e.g., for uncertainty estimation or probability maps). Yes, multi:softmax can generate probabilities, but you’ll need to handle the softmax conversion manually unless you’re using the scikit-learn API. For probability maps, multi:softprob is the better choice because it’s designed for this purpose and simplifies the workflow.

cartes de probabilité Random Forest

Thu, 26 Mar 2026 15:35:11 GMT

Générer une carte de probabilité à partir d’un modèle de type Random Forest (forêt aléatoire) repose sur la capacité du modèle à estimer, pour chaque observation ou pixel (dans le cas d’une cartographie spatiale), la probabilité d’appartenance à une classe donnée. Pour XGBoost, regarder XGBoost probability mapLe Random Forest est un modèle d’apprentissage supervisé qui construit un ensemble d’arbres de décision indépendants. Chaque arbre vote pour la classe la plus probable, et la classe finale est déterminée par un vote majoritaire. Pour chaque arbre : Lors de la prédiction, un nouvel échantillon (par exemple, un pixel avec ses caractéristiques) est "descendu" le long de l’arbre jusqu’à une feuille, qui contient une distribution de probabilité pour chaque classe. Aggregation des probabilités : Le modèle calcule la moyenne des probabilités prédites par tous les arbres pour obtenir une probabilité finale par classe. Pour chaque observation (pixel, point, etc.), le Random Forest ne se contente pas de prédire la classe majoritaire, mais peut aussi fournir la probabilité d’appartenance à chaque classe. Par exemple : Classe A : 70% Classe B : 20% Classe C : 10% Ces probabilités sont calculées en agrégeant les prédictions de tous les arbres de la forêt.Dans le contexte d’une cartographie (par exemple, une carte de couverture du sol, de risque, etc.) : Chaque pixel de la carte est représenté par un ensemble de variables explicatives (par exemple, indices de végétation, altitude, pente, etc.). Le modèle Random Forest est entraîné sur des données de référence (pixels déjà classés) pour apprendre à associer ces variables à une classe. Une fois entraîné, le modèle peut prédire, pour chaque pixel de la zone d’intérêt, la probabilité d’appartenance à chaque classe. Sélection d’une classe cible : Vous choisissez la classe pour laquelle vous souhaitez visualiser la probabilité (par exemple, "forêt", "zone inondable", etc.). Extraction des probabilités : Pour chaque pixel, vous récupérez la probabilité prédite par le modèle pour cette classe. Visualisation : Ces probabilités sont ensuite représentées sur une carte sous forme de dégradé de couleurs (par exemple, du bleu pour une faible probabilité au rouge pour une forte probabilité). Une probabilité proche de 1 indique une forte confiance du modèle dans l’appartenance du pixel à la classe cible. Une probabilité proche de 0 indique une faible probabilité. Les zones avec des probabilités intermédiaires peuvent correspondre à des zones de transition ou d’incertitude. Transparence : Contrairement à une carte de classification binaire, une carte de probabilité montre les nuances et les incertitudes du modèle. Flexibilité : Vous pouvez choisir un seuil de probabilité pour convertir la carte de probabilité en une carte binaire (par exemple, "si probabilité > 0.7, alors classe A"). Analyse des erreurs : Les zones avec des probabilités faibles ou intermédiaires peuvent être ciblées pour une vérification sur le terrain ou une amélioration du modèle. Imaginons une carte de risque d’inondation : Le modèle prédit, pour chaque pixel, la probabilité qu’il soit inondé. La carte de probabilité montre en rouge les zones où le risque est élevé (probabilité > 0.8), en orange les zones intermédiaires (0.5–0.8), et en jaune les zones à faible risque (< 0.5). Le Random Forest génère des cartes de probabilité en agrégeant les prédictions de nombreux arbres de décision, ce qui permet de visualiser non seulement la classe prédite, mais aussi la confiance du modèle dans cette prédiction. Cela rend l’analyse spatiale plus riche et plus informative.Si tu veux approfondir un point spécifique (par exemple, comment interpréter les incertitudes ou comment choisir un seuil de probabilité), n’hésite pas à demander !

Isolation Forest

Thu, 26 Mar 2026 13:15:17 GMT

Isolation Forest est un algorithme d’apprentissage non supervisé conçu pour détecter les anomalies (ou outliers) dans un jeu de données. Contrairement aux méthodes classiques (comme le clustering ou les modèles statistiques), il repose sur l’idée que les anomalies sont plus faciles à isoler que les données normales, car elles se trouvent dans des régions moins denses de l’espace des caractéristiques. Isolation par partitionnement aléatoire : L’algorithme construit un ensemble d’arbres d’isolation (iTrees) en sélectionnant aléatoirement une caractéristique et une valeur de partitionnement (entre min et max) pour diviser les données. Les anomalies sont isolées en moins d’étapes (profondeur de l’arbre plus faible) que les données normales. Score d’anomalie : Pour chaque observation, un score d’anomalie est calculé en fonction de la profondeur moyenne des arbres nécessaires pour l’isoler. La longueur moyenne du chemin depuis la racine de l'arbre jusqu'au nœud feuille où un point est isolé sert de mesure de son score d'anomalie. Score proche de 1 : Anomalie probable. Score proche de 0 : Donnée normale. Efficacité : Rapide même sur de grands jeux de données (complexité linéaire). Non supervisé : Ne nécessite pas de données étiquetées. Robuste : Performant dans des espaces à haute dimension et avec des données bruitées. Interprétable : Les arbres d’isolation permettent de visualiser les partitions. Détection de fraudes (transactions bancaires anormales). Surveillance de systèmes industriels (défaillances de capteurs). Analyse de logs (comportements suspects dans des réseaux). Nettoyage de données (identification de valeurs aberrantes). n_estimators : Nombre d’arbres d’isolation (par défaut 100). max_samples : Taille de l’échantillon utilisé pour construire chaque arbre (par défaut 256). contamination : Proportion attendue d’anomalies dans les données (par défaut "auto"). Sensibilité aux échelles : Les données doivent être normalisées si les caractéristiques ont des échelles très différentes. Difficulté avec les anomalies locales : Moins efficace pour détecter des anomalies dans des sous-groupes denses. Interprétation des scores : Le choix du seuil pour classer une observation comme "anomalie" peut être subjectif. Sensitivity to Correlated Features: May cause unnecessary splits, reducing accuracy. Limited for Sequential Data: Not ideal for time-series or dependent data Dans un jeu de données de transactions bancaires : Les transactions normales sont regroupées dans des zones denses. Une transaction anormalement élevée ou isolée sera détectée comme une anomalie avec un score proche de 1. Scikit-learn (sklearn.ensemble.IsolationForest) : Implémentation standard en Python. Extensions : Variantes comme Extended Isolation Forest pour des améliorations de performance. Résumé : Isolation Forest est une méthode puissante et scalable pour détecter des anomalies sans supervision, idéale pour les jeux de données volumineux et multidimensionnels.

python_anomaly_detection_isolation_forest

Thu, 26 Mar 2026 13:01:10 GMT

Streamlit Tests

Mon, 23 Mar 2026 16:08:28 GMT

Streamlit is an open-source Python framework for data scientists and AI/ML engineers to deliver dynamic data apps with only a few lines of code. Build and deploy powerful data apps in minutes. Let's get started!Streamlit's architecture allows you to write apps the same way you write plain Python scripts. To unlock this, Streamlit apps have a unique data flow: any time something must be updated on the screen, Streamlit reruns your entire Python script from top to bottom.This can happen in two situations: Whenever you modify your app's source code. Whenever a user interacts with widgets in the app. For example, when dragging a slider, entering text in an input box, or clicking a button.
Whenever a callback is passed to a widget via the on_change (or on_click) parameter, the callback will always run before the rest of your script. For details on the Callbacks API, please refer to our Session State API Reference Guide.Transforming a folder with scripts to a package : add a __init__.py file to be treated as package Import pipeline_tools.py using relative path from other files using from .pipeline_tools.py import * Run with this to run it as a Python module :python -m streamlit run your_script.py
With st.map() you can display data points on a map. Let's use Numpy to generate some sample data and plot it on a map of San Francisco.PYTHONimport streamlit as st import numpy as np import pandas as pd map_data = pd.DataFrame( np.random.randn(1000, 2) / [50, 50] + [37.76, -122.4], columns=['lat', 'lon']) st.map(map_data)
When adding long running computations to an app, you can use st.progress() to display status in real time.First, let's import time. We're going to use the time.sleep() method to simulate a long running computation:import streamlit as st import time 'Starting a long computation...' # Add a placeholder latest_iteration = st.empty() bar = st.progress(0) for i in range(100): # Update the progress bar with each iteration. latest_iteration.text(f'Iteration {i+1}') bar.progress(i + 1) time.sleep(0.1) '...and now we\'re done!' Caching allows your app to stay performant even when loading data from the web, manipulating large datasets, or performing expensive computations.The basic idea behind caching is to store the results of expensive function calls and return the cached result when the same inputs occur again. This avoids repeated execution of a function with the same input values.To cache a function in Streamlit, you need to apply a caching decorator to it. You have two choices:
@st.cache_data def long_running_function(param1, param2): return … In the above example, long_running_function is decorated with @st.cache_data. As a result, Streamlit notes the following: The name of the function ("long_running_function"). The value of the inputs (param1, param2). The code within the function. Before running the code within long_running_function, Streamlit checks its cache for a previously saved result. If it finds a cached result for the given function and input values, it will return that cached result and not rerun function's code. Otherwise, Streamlit executes the function, saves the result in its cache, and proceeds with the script run. During development, the cache updates automatically as the function code changes, ensuring that the latest changes are reflected in the cache.
As apps grow large, it becomes useful to organize them into multiple pages. This makes the app easier to manage as a developer and easier to navigate as a user. Streamlit provides a powerful way to create multipage apps using st.Page and st.navigation. Just create your pages and connect them with navigation as follows: Create an entry point script that defines and connects your pages Create separate Python files for each page's content
Use st.Page to define your pages and st.navigation to connect them

caching_streamlit

Mon, 23 Mar 2026 16:00:30 GMT

Curse of dimensionality

Mon, 23 Mar 2026 15:41:13 GMT

curse_dimensionality

Mon, 23 Mar 2026 15:40:57 GMT

REGEX basics

Fri, 20 Mar 2026 13:27:28 GMT

.* = match any characters (0 or more) ^ = start of string $ = end of string \d+ = one or more digits [abc] = any character in brackets Examples: ".*id_pol.*" - matches any filename containing "id_pol" anywhere "^id_pol.*" - matches filenames starting with "id_pol" ".*labels$" - matches filenames ending with "labels" ".*id_pol_\d+.*" - matches filenames with "idpol" followed by numbers

Normalization data leak

Fri, 20 Mar 2026 10:40:29 GMT

This is a subtle but important distinction with real consequences:You normalize the entire train set once, then do CV on that pre-normalized train data.The Problem: Each CV fold's training data influences the scaler, but the scaler was already fit on the entire train set Within each fold, you're using statistics from data outside that fold This is a mild form of data leakage — each fold's validation set sees scaler parameters influenced by other folds' data Impact magnitude: Small to moderate. The leakage is less severe than fitting on train+test combined, but it's still present.Fit a fresh scaler on each fold's training data only, then transform that fold's validation data.Why it's better: Each fold is completely independent Validation fold truly hasn't touched the scaler fitting process Mimics real-world deployment where future data is unseen during scaler fitting Impact magnitude: Zero leakage. Most honest performance estimate.For most datasets: The difference is small to negligible (1-3% performance variation) Your current method already does the hard work: you split before normalizing The remaining leakage is just "within-train" fold contamination, which is minor When it becomes significant: Small datasets (< 1000 samples): The leakage impact increases because folds are tiny and statistics are more unstable High-dimensional data: Small differences in normalization can propagate more Highly imbalanced classes: Fold-to-fold variance in statistics is higher Sensitive models: Tree-based models (RF, XGB) are robust; linear models (LR, SVM) are more sensitive to normalization differences Your approach is practically acceptable for most real-world scenarios, but it's theoretically suboptimal. If: Your dataset is reasonably sized (> 5000 samples) You're using robust models You just want quick results → Your method is fine, the impact is negligible.If you want maximum rigor or have a small dataset, normalizing inside the CV loop is the correct approach, though the practical improvement will likely be modest.

KDTree

Wed, 18 Mar 2026 16:46:29 GMT

A KDTree (k-dimensional tree) is a data structure for answering the question "given a point in N-dimensional space, which of these stored points is nearest to it?" very efficiently — in O(log n) instead of brute-force O(n).
video visual explanation

Alias Ubuntu Pycharm

Fri, 13 Mar 2026 11:18:41 GMT

Pour remplacer le lancement de PyCharm via /opt/pycharm-*/bin/pycharm par la simple commande pycharm dans votre terminal Ubuntu sous WSL2 voici comment procéder :Ouvrez votre terminal WSL2 et suivez ces étapes :Exécutez la commande suivante pour lister les versions installées :ls /opt/pycharm-*/ Notez le chemin exact (par exemple : /opt/pycharm-2023.3/bin/pycharm.sh).Ouvrez le fichier .bashrc ou .zshrc (selon votre shell) :nano ~/.bashrc Ajoutez cette ligne à la fin du fichier (remplacez le chemin par le vôtre) :alias pycharm='/opt/pycharm-*/bin/pycharm' Enregistrez et quittez (Ctrl+O, Entrée, Ctrl+X).Exécutez :source ~/.bashrc Désormais, tapez simplement pycharm dans le terminal pour lancer PyCharm.

SHAP feature grouping

Thu, 05 Mar 2026 15:33:35 GMT

When you want to group dependent features for SHAP analysis—especially when you already know which features logically belong together—you can implement this idea in two main ways: Concept: The PartitionExplainer allows you to define groups of features that are treated as a single "super-feature" during Shapley value computation. This ensures that the contribution of the entire group is evaluated together, rather than splitting importance arbitrarily among dependent features. How to Apply: Define Groups: Create a list of feature groups, where each group contains the indices or names of features that should be treated as a unit. Initialize Explainer: Pass your model, data, and the feature groups to shap.PartitionExplainer. Interpret Results: The Shapley values will now reflect the combined contribution of each group, making the output more stable and interpretable for dependent features. Use Case: Ideal for spatial data (e.g., grouping neighboring pixels or grid cells) or derived features (e.g., grouping all features computed from the same source variable). Concept: Instead of using SHAP’s built-in tools, you can pre-process your data to create new "meta-features" that represent the groups. This can be done by aggregating, averaging, or otherwise combining the dependent features before passing them to the model and SHAP. How to Apply: Create Meta-Features: For each group, compute a summary statistic (e.g., mean, max, sum) or use dimensionality reduction (e.g., PCA) to represent the group as a single feature. Train Model: Train your model using these meta-features instead of the original dependent features. Apply SHAP: Use any SHAP explainer (e.g., TreeExplainer, KernelExplainer) on the reduced feature set. Use Case: Useful when you want to simplify the interpretation or when the dependency structure is complex and better captured by a single representative feature. Trade-offs: Grouping features may hide fine-grained insights within the group. Balance between interpretability and granularity. Validation: After grouping, validate that the model’s performance and the SHAP explanations still align with your expectations and domain knowledge. Visualization: When using PartitionExplainer, SHAP’s visualization tools (e.g., summary_plot, force_plot) will automatically respect the groups, making it easier to communicate results. In practice, if you already know your feature groups, PartitionExplainer is the most straightforward and theoretically sound approach. If you need more flexibility or want to reduce dimensionality, custom grouping via feature engineering is a robust alternative.

explainable AI

Thu, 05 Mar 2026 15:30:59 GMT

SHAP

Thu, 05 Mar 2026 15:30:09 GMT

SHAP is a python library for explainable AI with Shapley values : Shapley values are a widely used approach from cooperative game theory that come with desirable properties. Shapley values quantify the contribution of each feature to a model’s output, providing a fair and consistent way to explain individual predictions.
Warning : Assumes feature independence, which may not hold in practice. see Shapley values for CarHab interpretability for more details on this topic.
Shapley values originate from cooperative game theory, introduced by Lloyd Shapley in 1951. They provide a method to fairly distribute the total gain generated by a coalition of players among the individual players.
In the context of machine learning, features of a model can be thought of as players, and the prediction of the model is the total gain to be distributed among these features.In game theory, a coalition is a subset of players, and the value function describes the total gain from the coalition. The Shapley value for a player is the average marginal contribution of that player across all possible coalitions. This ensures a fair distribution based on the individual contributions of each player.
In machine learning, Shapley values can be used to explain the output of complex models by attributing the model's prediction to each feature's contribution. The SHAP library (SHapley Additive exPlanations) is a popular tool for computing Shapley values for various machine learning models. Global Interpretability: Understanding the overall importance of features across the entire dataset. Local Interpretability: Explaining individual predictions by identifying which features contributed most to a specific prediction. Shapley values help in understanding which features are most important for a given prediction, making complex models more interpretable.While Shapley values are powerful for model interpretability, they come with certain limitations and assumptions: Independence Assumption: Shapley values assume that features are independent, which may not hold in practice. Computational Complexity: Calculating Shapley values can be computationally expensive, especially for models with many features. Approximation Methods: Different methods like KernelSHAP, TreeSHAP, and DeepSHAP are used to approximate Shapley values, each with its own assumptions and limitations. Existence and Uniqueness: Shapley values may not be unique or may not exist for certain types of games or models. By understanding these aspects, practitioners can effectively use Shapley values to interpret their machine learning models while being aware of the potential limitations.

L1 Norm

Tue, 03 Mar 2026 14:21:09 GMT

For the disagreement maps interpretationWhen computing disagreement between two maps, L1 and L2 represent different distance metrics that measure how different the values are. Here's the key difference: L1 = Σ |difference_i| Sums the absolute differences directly across all parameters Each parameter's disagreement is added linearly Less sensitive to outliers because large differences don't get magnified Result is in the same units as the original differences Better for detecting consistent, distributed disagreement across parameters Example: If you have 3 parameters with differences (1, 2, 3): L1 = |1| + |2| + |3| = 6
L2 = √(Σ difference_i²) Squares differences before summing, then takes the square root Large disagreements get heavily weighted (squared term) More sensitive to outliers — a few large differences dominate Result is the geometric distance in multi-dimensional space Better for detecting localized, extreme disagreement Example: Same 3 parameters with differences (1, 2, 3): L2 = √(1² + 2² + 3²) = √14 ≈ 3.74 In compute_composite_disagreement(): L1 method: Each pixel shows the sum of all parameter differences → highlights regions where many parameters disagree L2 method: Each pixel shows the Euclidean norm → emphasizes regions where at least one parameter disagrees significantly Choose L1 if you want to see overall cumulative disagreement; choose L2 if you want to highlight hotspots of severe disagreement in specific parameters.

composite_disagreement_plot_Cote_dOr_21_V2_Biotopes_Modele20210126_vs_Cote_Dor_bayesSearch_xgb_polygonwise_classWeighted_round0_biotopes_L2

Tue, 03 Mar 2026 14:21:07 GMT

composite_disagreement_plot_Cote_dOr_21_V2_Biotopes_Modele20210126_vs_Cote_Dor_bayesSearch_xgb_polygonwise_classWeighted_round0_biotopes_L1

Tue, 03 Mar 2026 14:21:01 GMT

L1 Regularization (Lasso)

Tue, 03 Mar 2026 14:17:50 GMT

Installer une LLM en local pour un humain local

Thu, 05 Feb 2026 15:37:28 GMT

Ce guide s'adresse à un public non spécialisé, qui souhaite prendre en main un modèle de type chatGPT gratuitement sans dépendre des interfaces web OpenAI et des géants de la tech (Google,Microsoft ...) pour des soucis principalement de confidentialité des données.Le but est de vous faire prendre en main un logiciel open-source utilisant des modèles open-source gratuits, le tout sans aucune compétence technique ni d'ordinateur spécialisé. Bien évidemment, la puissance de calcul étant un facteur majeur dans la vitesse d'execution du modèle, la taille du modèle doit être modulée selon les capacités. En fin de page se trouve un lexique pour les différents termes et acronymes utilisés.Minimal : Un ordinateur doté d'au moins 12 Go de RAM Un processeur de moins de 5 ans ~30 Go de libre sur le disque dur Recommandé : Une carte graphique (GPU) NVIDIA de moins de 5 ans et 8 Go de VRAM doté de l'architecture CUDA.Si vous n'avez pas de GPU Nvidia, ignorer cette étape. On peut faire tourner le modèle uniquement sur processeur sans problème.Sinon,
Installer l'outil CUDA Toolkit pour pouvoir exploiter la GPU avec CUDA en suivant les étapes suivantes :
Suivre les instructions de l'installer .exe après avoir saisi un mot de passe administrateur. Cliquer sur suivant jusqu'à cette partie et cocher la case :
Suivre les instructions jusqu'à la fin de l'installation. Voilà, vous pouvez à présent exploiter les capacités de votre carte graphique pour des modèles de deep learning !
Télécharger https://jan.ai/ en choisissant votre système d'exploitation (détecté automatiquement par défaut) :
Puis executer l'installer .exe :
Une fois fini, vous arrivez sur la page d'accueil de Jan, avec une plateforme de modèles disponibles à télécharger sur votre machine :
L'application analyse automatiquement les capacités mémoire de votre système pour vous recommander des modèles. Ceux en vert Recommended sont ceux qui peuvent tourner sur votre machine.Il y a deux types de modèles accessibles ici : Les modèles open-source que l'on peut télécharger en local (sur votre machine), par exemple ici Gemma 7B Q4 (modèle Google open-source) qui sont avec le bouton bleu Download Les modèles fermés, auquels on peut accéder avec une clé d'API pour intéragir avec comme si on était sur leur site web. Cela nécéssite par contre d'être connecté à internet, contrairement à un modèle local. Les données de chat seront envoyées au fournisseur de modèle également. Par exemple, on peut utiliser une clé d'API OpenAI pour chatter avec chatGPT 4 dans l'interface Jan. Ils sont affichés avec un bouton Use, sans qu'on les ait installés. PS: Les modèles dans l'exemple ci-dessus tel que Llama 3 8B Q4 ont le bouton Use parce que je les ai déjà installés sur ma machine.On peut cliquer sur la flèche pour dérouler la fiche du modèle qui décrit sa spécialité :
On verra ici que le modèle Aya 23 a été designé et entrainé pour être multilingue, tandis que LLama 3 est un modèle également multilingue mais qui a pour but d'être généraliste et bon codeur.8B : 8 Billion = 8 milliards de paramètres => Taille en terme de paramètres du modèle.
Q4 : Quantifié sur 4 bits => On a réduit la précision du modèle d'origine avec des paramètres en 32 bits (float32) à 4 bits (INT4) , ce qui le rend 8 fois plus léger. Ce processus s'appelle la Quantization. C'est une des techniques de compression de modèle qui permet de faire tourner des modèles puissants sur des ordinateurs avec une mémoire réduite, avec une faible perte en qualité (varie selon la méthode).Une fois un modèle choisi, cliquer sur Download pour lancer le téléchargement. Si une erreur s'affiche, c'est sûrement dû au proxy qui bloque l'accès internet.Allez sur le bouton paramètres ⚙ en bas à gauche de l'interface et cliquez sur Advanced Settings :
Sur la partie "HTTPS Proxy", saisir l'adresse de votre proxy (ici UJM) :
http://cache.univ-st-etienne.fr:XXXX en remplaçant XXXX par les valeurs de port.Une fois le proxy configuré, les téléchargements seront possibles !Sur cette même page, vous pouvez choisir, si vous en avez une, la carte graphique à utiliser sous GPU Acceleration, qu'on doit activer avec le bouton bleu à droite, puis choisir dans le menu déroulant la GPU (ici NVIDIA RTX 4000 ...)Vous pouvez choisir également où Jan va télécharger les modèles et stocker les conversations dans Jan Data Folder .
Cette page de Chat s'organise en trois parties principales : Sur la gauche, la liste des discussions passées. Au centre, la discussion entre l'utilisateur et le modèle de langage. Sur la droite, des paramètres d'inférence du modèle.
L'onglet Model nous permet de modifier les paramètres de l'inférence du modèle dans Inference Parameters. Chaque paramètre est expliqué en laissant la souris sur l'icone d'information ℹ.Il y a aussi les paramètres du modèle dans Model Parameters, ainsi que d'autres paramètres dans Engine Parameters :
Ces deux paramètres vont largement influencer la performance en terme de rapidité de réponse du modèle, en particulier : Context Length : nombre de tokens précédents "vus" à l'instant de la génération du prochain token de la phrase. Plus Context Length est large, plus la quantité de mémoire utilisée est haute et la réponse lente, mais plus elle sera informée de ce qui a été dit précédemment.🎉 Bravo vous avez fini ce tutoriel ! Vous pouvez maintenant faire tourner n'importe quel modèle open-source adapté à votre machine ! 🎉
Pour continuer sur la suite de ce guide pour faire du Retrieval-Augmented Generation plus avancé mais facile d'utilisation à l'aide de modèles locaux en utilisant GPT4All, suivez le guide RAG simple, local et Open-Source avec GPT4All
Pour une utilisation de LM Studio et DeepSeek : suivre le guide Installer DeepSeek R1 Distill en localLLM : Large Language model = Large Modèle de langage (type ChatGPT)en local : Installation sur la machine physique et non pas sur un serveur distant (cloud / interface web)GPU / Carte Graphique : Graphics Processing Unit = Carte Graphique => unité de calcul dédiée au calcul matriciel, utilisé entre autres pour entraîner et executer des modèles de LLM.CPU / Processeur : Processeur, unité de calcul primaire de votre ordinateur, plus lent que le GPU pour faire tourner un LLM mais suffisant pour les petits modèles.CUDA (Compute Unified Device Architecture) : plateforme de calcul parallèle développée par NVIDIA qui permet aux programmeurs d'exploiter la puissance de traitement parallèle des GPU. Il fournit un ensemble d'outils et d'API pour tirer parti des capacités de calcul des GPU, accélérant ainsi les tâches informatiques complexes. CUDA permet une utilisation efficace des GPU pour le traitement parallèle, ce qui en fait un outil puissant dans le domaine du calcul haute performance.API (Application Programming Interface) : est un ensemble défini de routines, protocoles et outils qui permettent à différentes applications de communiquer et d'interagir entre elles. Il s'agit d'une interface qui définit comment les logiciels doivent interagir pour échanger des données et utiliser les fonctionnalités d'un autre logiciel ou système. Les API simplifient le développement de logiciels en fournissant un ensemble de règles et d'outils que les développeurs peuvent utiliser pour accéder aux fonctionnalités d'un système ou d'une plateforme.clé d'API : Une clé d'API est une séquence unique de lettres et de chiffres utilisée pour identifier et authentifier une application ou un service particulier. Il s'agit d'une corde de caractères qui permet à votre application d'accéder de manière sécurisée à une API (interface de programmation d'application). Les clés API sont souvent utilisées pour vérifier l'identité de votre application et vous permettre d'utiliser les fonctionnalités ou les données fournies par l'API.tokens/jetons : Dans le contexte de l'apprentissage automatique et du traitement du langage naturel, les jetons font référence à des unités de base utilisées pour représenter du texte ou des données. Ils peuvent être des mots, des caractères ou même des sous-mots, en fonction de la méthode de jetonisation utilisée. Les jetons sont utilisés pour créer des représentations numériques du langage, qui peuvent ensuite être utilisées pour entraîner des modèles de langage ou effectuer des tâches telles que la classification de texte ou la génération de langage.Autre plateforme :
LM Studio : Setting up LM Studio

Spatial Cross-validation

Fri, 16 Jan 2026 14:22:59 GMT

(source)
(source)
Spatial Cross-validation is a subset of the Cross-validation method to avoid having Data Leakage, which is the phenomenon where the training dataset shares examples with the test/validation datasets, which can lead to overfitting of the model.In the spatial case, the samples are close to each other and are spatially correlated as a result, which can inflate the results of the model but hinder generalization.In more detail, the expected predictive performance of a model to unseen observations can be estimated using independent samples. For this, models are most often cross-validated, meaning that available samples are used alternatingly to either train or validate the model. However, the fact that an observation was not used for training a model does not necessarily imply that this observation is truly independent from the training data. A dependence between observations can already arise through their spatial proximity, since usually nearby things are more related than distant things
from : Spatial validation reveals poor predictive performance of large-scale ecological mapping models
Figure Workflow of model cross-validation strategies : Schematic illustration of the three strategies used for model cross-validation (random K-fold CV, spatial K-fold CV, buffered leave-one-out CV). In K-fold CVs, observations within the same fold are represented with a similar symbol and color. In the B-LOO CV, the test observation is represented as a red dot, training observations as grey squares, and observations within the exclusion buffer (black circle) are crossed.The first strategy corresponded to a common K-fold cross-validation whereby observations were randomly split into K sets (random K-fold CV), ignoring any structure of spatial dependence in the data. Model training was then performed iteratively on K-1 sets, each time withholding a different set for testing. The vector of so-called independent AGB predictions was then used to generate CV statistics, namely, the squared Pearson’s correlation between observed AGB values and AGB predictions (noted ) and the root mean squared prediction error (RMSPE). Here, we used K = 10, a common choice made by modelers.
The second strategy, i.e., spatial K-fold CV, differs from the random K-fold CV in the way observations are split into spatially structured sets. Here, the objective is to group observations into spatially homogeneous clusters of larger size than the range of autocorrelation in the data to achieve independence between CV folds. Spatial clusters were generated using a hierarchical cluster analysis (see : Decision Tree vs Hierarchical clustering) (complete linkage method) of the distance matrix of pixel geographical coordinates and a clustering height (i.e., the maximum distance between pixels within each cluster) of H = 150 km, i.e., a slightly longer distance than the range of autocorrelation of forest AGB (Fig. 2a).The third strategy, i.e., the buffered leave-one-out cross-validation (B-LOO CV), is inspired by the leave-one-out cross-validation scheme, in that a single observation is withheld for model testing per model run. In the case of B-LOO CV, however, observations within a distance r from the test observation are excluded from the model training set. Training and testing the model for a range of r values allows investigation of the influence of spatial proximity between test and training observations on model prediction error.
Here, we considered 16 r values (from 0 to 150 km by 10 km); hence, the model was calibrated and tested 16 times per test observation. To generate B-LOO CV statistics presented in Fig. 5a, b, we (1) generated AGB predictions for 100 randomly selected test observations (i.e., 1600 model runs), allowing computation of the model’s and RMSPE for each r value, and (2) repeated this procedure ten times (i.e., 16,000 model runs) to provide the average and standard deviation of CV statistics over the 10 repetitions. It is worth noting that we integrated a safeguard against predictive extrapolation within the iterative B-LOO CV procedure. Because the geographical and environmental spaces are closely linked, removing training observations in the spatial neighborhood of a test observation may remove the environmental (and optical) conditions found at that test location from the model’s calibration domain. The model’s prediction at that test location would thus amount to a case of predictive extrapolation (i.e., forcing the model to predict outside the calibration domain), leading to an inflation of model error that we did not intend to consider here. For each randomly selected test observation, we thus first removed neighboring observations at the largest r value and verified that optical and environmental conditions at the test location still fell within the range of values found in the model’s calibration domain. If not, we discarded the test observation and randomly selected a new observation.
Spatial cross-validation is not the right way to evaluate map accuracy

Fig. 1. Overview of the study area and results of the evaluation of validation strategies. a Study area in the Amazon basin with values of the above-ground biomass, according to the Baccini map (Baccini et al., 2012). b–d Error in estimates of the population RMSE (in Mgha−1) for calibration samples collected by systematic random (b), simple random (c) and two-stage cluster random (d) sampling. Note that the horizontal grey line at 0 in b–d effectively refers to the population RMSE, because deviations from the population RMSE are plotted. The sampling locations shown in the maps in b–d are one realization out of 500.Probability sampling is a technique in which the researcher chooses samples from a larger population using a method based on probability theory. Simple Random Sampling: This method involves randomly selecting a sample from the population without any bias. It’s the most basic and straightforward form of probability sampling. Stratified random Sampling: This method involves dividing the population into subgroups or strata and selecting a random sample from each stratum. This technique is useful when the population is heterogeneous and you want to ensure that the sample is representative of different subgroups. Cluster Sampling: This method involves dividing the population into groups or clusters and then randomly selecting some of those clusters. This technique is useful when the population is spread out over a large geographical area. But It is not possible or practical to survey everyone. This is an excellent constraint that significantly limits your options. Let me break down which methods are compatible with your polygon-based spatial separation requirement.SMOTE creates synthetic samples by interpolating between nearest neighbors. The problem:Polygon A: pixels from same polygon Polygon B: pixels from different polygon If SMOTE finds a minority pixel from Polygon A and neighbors in Polygon B, it creates a synthetic point that blends characteristics from both polygons. During cross-validation with polygon-wise splits, this synthetic point contains "information" from the test polygon, causing data leakage.

Model Quantization

Wed, 03 Dec 2025 10:44:01 GMT

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).Reducing the number of bits means the resulting model requires less memory storage, consumes less energy (in theory), and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.K-quantization options are labeled "S", "M", and "L" and stand for small, medium, and large model sizes, respectively. Option "0" represents baseline quantization without extra calibration. In terms of quality and speed: 0 (lowest quality, fastest speed) < S < M < L (highest quality, slowest speed). details The basic idea behind quantization is quite easy: going from high-precision representation (usually the regular 32-bit floating-point) for weights and activations to a lower precision data type. The most common lower precision data types are: float16, accumulation data type float16 bfloat16, accumulation data type float32 int16, accumulation data type int32 int8, accumulation data type int32 The accumulation data type specifies the type of the result of accumulating (adding, multiplying, etc) values of the data type in question. For example, let’s consider two int8 values A = 127, B = 127, and let’s define C as the sum of A and B:Here the result is much bigger than the biggest representable value in int8, which is 127. Hence the need for a larger precision data type to avoid a huge precision loss that would make the whole quantization process useless.The two most common quantization cases are float32 -> float16 and float32 -> int8.Performing quantization to go from float32 to float16 is quite straightforward since both data types follow the same representation scheme. The questions to ask yourself when quantizing an operation to float16 are: Does my operation have a float16 implementation?
Does my hardware suport float16? For instance, Intel CPUs have been supporting as a storage type, but computation is done after converting to float16float32. Full support will come in Cooper Lake and Sapphire Rapids. Is my operation sensitive to lower precision? For instance the value of epsilon in LayerNorm is usually very small (~ 1e-12), but the smallest representable value in float16 is ~ 6e-5, this can cause NaN issues. The same applies for big values. Performing quantization to go from float32 to int8 is more tricky. Only 256 values can be represented in int8, while float32 can represent a very wide range of values. The idea is to find the best way to project our range [a, b] of float32 values to the int8 space.Let’s consider a float x in [a, b], then we can write the following quantization scheme, also called the affine quantization scheme:where: x_q is the quantized int8 value associated to x S and Z are the quantization parameters S is the scale, and is a positive float32 Z is called the zero-point, it is the int8 value corresponding to the value 0 in the float32 realm. This is important to be able to represent exactly the value 0 because it is used everywhere throughout machine learning models. The quantized value x_q of x in [a, b] can be computed as follows:And float32 values outside of the [a, b] range are clipped to the closest representable value, so for any floating-point number x:Usually round(a/S + Z) corresponds to the smallest representable value in the considered data type, and round(b/S + Z) to the biggest one. But this can vary, for instance when using a symmetric quantization scheme as you will see in the next paragraph.

spatial_cv_schema

Mon, 01 Dec 2025 14:30:07 GMT

Principal Component Analysis

Mon, 01 Dec 2025 14:15:27 GMT

Source , image source

What kind of dataset should PCA be used on? PCA is an unsupervised learning algorithm which means that it does not require there to be a specific outcome variable you are trying to predict in your dataset. Instead, PCA is used when you have a set of features and you want to reduce the dimensionality of your feature set. This simply means that you want to condense as much of the information in your input features as possible into a smaller set of transformed features. In particular, PCA is intended to be used when you have a set of numeric features you want to condense.What are the main advantages and disadvantages of PCA? Here are some advantages and disadvantages you should keep in mind when deciding whether to use PCA. Guaranteed to produce uncorrelated features. No matter how highly correlated the input features that go into your PCA model are, the transformed features that come out of the model are guaranteed to be uncorrelated. This is a big advantage as correlated features tend to cause problems for a lot of machine learning algorithms. Relatively fast. Another advantage of PCA is that it is relatively fast compared to other dimensionality reduction techniques. PCA makes use of simple linear algebra computations that are easy for computers to handle. That means it is a good option when you have a large dataset with many observations. Not sensitive to choice of seed. Another advantage of PCA is that it is not sensitive to the choice of seed or any other initialization conditions. PCA is a deterministic algorithm, which means that it will always produce the same result when applied to the same dataset. No hyperparameters. Another advantage of PCA is that there are no hyperparameters that need to be tuned. This means that you do not have to go through the additional step of hyperparameter tuning when applying PCA to your data. Popular and well studied. PCA is one of the most common dimensionality reduction techniques out there, which means that many data scientists are familiar with it. This means that it will be easier for collaborators to contribute to projects that use PCA than it would be for them to contribute to projects that use more obscure algorithms. Assumes relationships between features are linear. One of the main disadvantages of PCA is that it makes the assumption that the relationships between the different features in the input data are linear. This means that it may not perform well in situations where the relationships between features are non-linear Does not necessarily preserve local structure of data. PCA does not necessarily preserve the local structure of your data. This means that observations that are close together in the original features space will not necessarily be close together in the transformed features space. This can be a problem if you want to apply something like clustering or visualization techniques to the data. Need to rescale features. Another disadvantage of PCA is that it is sensitive to scale. That means that you may need to rescale your features before you apply PCA to them. Sensitive to outliers. Another disadvantage of PCA is that it is sensitive to outliers. If there are outliers in your dataset, they may have an oversized effect on the model. You will end up with transformed features that are more representative of a few outlying points than the bulk of the data. Cannot handle missing values. Another disadvantage of traditional PCA is that it cannot handle missing data. This means you may have to preprocess your data to handle any missing values. There are some extensions of PCA that can handle missing values, but they may or may not be available in common machine learning libraries. Only suitable for continuous data. Another disadvantage of PCA is that it is only suitable for continuous variables. If you have a mixture of continuous and categorical variables in your dataset, you may want to consider other dimensionality reduction methods. Does not perform well when input features are not correlated. PCA does not perform well in situations where none of the input features are correlated with one another. If there is no information that is shared between features, the algorithm will not be able to compress shared information into fewer features. When should you use principal component analysis rather than another dimensionality reduction technique? Here are some examples of situations where you should use principal component analysis. Many correlated features. If you have many correlated features in your dataset you want to apply an algorithm that does not perform well on correlated features to the dataset, this is a great use case for PCA. All you have to do is apply PCA to the set of correlated features and replace the input features with the transformed features produced by the PCA model. The transformed features are guaranteed to be independent of one another no matter how highly correlated the input features were. Quick and easy dimension reduction. PCA is a great model to use if you need to apply a quick and easy dimension reduction technique for something like a prototype or proof-of-concept. The model is deterministic and there are no hyperparameters to tune, so you only have to apply the model to your data one time and you are done. When should you avoid using principal component analysis? Here are some examples of situations where you should avoid using principal component analysis.
Features are not linearly related. Principal component analysis performs best when it is applied to a dataset where all of the features are linearly related. If you do not think that the features in your dataset are linearly related, you may be better off using a dimensionality reduction technique that makes fewer assumptions about the data. For example, t-sne is an example of a non-parametric algorithm that makes fewer assumptions about the structure of the data.
Visualizing data. If the primary reason you want to reduce the number of dimensions in your data is so that you can visualize your data, you are generally better off using an algorithm like t-sne that preserves local relationships in the data. Algorithms that preserve local relationships try to ensure that observations that are close together in the input feature space are also close together in the transformed feature space, which is what you want if you are trying to visualize data. PCA focuses more on preserving global trends in the data and less on preserving local relationships between specific points. Need interpretable features. Most dimension reduction techniques produce features that do not have a straightforward interpretation. If you need all of the features in your dataset to be directly interpretable, you may be better off using feature selection techniques instead of traditional dimensionality reduction techniques to reduce the size of your dataset. 1. Explained Variance (explained_variance_) Description: This attribute is an array of shape (n_components,) that stores the amount of variance explained by each of the selected principal components (PCs). Access: You can access it using the following code: Pythonpca_model = PCA(...) # Create your PCA model explained_variance = pca_model.explained_variance_ Interpretation: Each element in the explained_variance array represents the variance captured by a corresponding PC. Higher values indicate that the PC captures more variability in the original data. 2. Explained Variance Ratio (explained_variance_ratio_) Description: This attribute is an array of shape (n_components,) that stores the percentage of variance explained by each of the selected PCs. Access: You can access it using the following code: Pythonpca_model = PCA(...) # Create your PCA model explained_variance_ratio = pca_model.explained_variance_ratio_ Interpretation: Each element in the explained_variance_ratio array represents the fraction of the total variance captured by a corresponding PC, expressed as a percentage. These values are typically between 0 and 1, with the sum of all ratios usually close to 1 (assuming all components are kept). 3. Connection to Compression RateBoth explained_variance_ and explained_variance_ratio_ help you assess the compression achieved by PCA. To understand how: Compression in PCA: PCA aims to capture the most significant variations in the data using a smaller number of dimensions (PCs). This essentially compresses the data by discarding directions that contribute less to the overall variance. Using explained_variance_: By summing the values in the explained_variance_ array, you get the total variance captured by the selected PCs. Dividing this sum by the original data's total variance (often estimated using n_samples - 1 degrees of freedom) gives you the proportion of variance retained after applying PCA. Using explained_variance_ratio_: Directly sum the elements in the explained_variance_ratio_ array to get the cumulative percentage of variance explained by the chosen PCs. This value reflects the combined impact of all PCs on the data's variability.

Diffusion Beats Autoregressive in Data-Constrained Settings

Thu, 21 Aug 2025 13:04:43 GMT

paper linkA Pareto frontier is a concept from multi-objective optimization. In the context of the paper's "Figure 1," it represents the set of models that are the most efficient. A model is on the Pareto frontier if you can't improve one objective (e.g., lower validation loss) without worsening another objective (e.g., increasing training FLOPs).Essentially, the Pareto frontier shows the best possible trade-off between model performance (validation loss) and computational cost (training FLOPs). Any model that falls below the frontier is sub-optimal because a better model exists that either has a lower loss for the same compute or the same loss for less compute.The Chinchilla-optimal compute point refers to the optimal balance between a language model's size and the amount of data it's trained on, as determined by the "Chinchilla" scaling laws from the paper by Hoffmann et al. (2022).This point indicates the most efficient allocation of a given computational budget. The original Chinchilla paper found that prior models were under-trained, meaning they used too little data for their size. The Chinchilla-optimal point identifies the ideal training duration (and thus, compute) to achieve the best performance for a model of a specific size. In the context of your paper, this is the point where the autoregressive (AR) model is initially at its peak performance, before it begins to overfit with repeated data.
Diffusion models surpass autoregressive models given sufficient compute. Across a wide range of unique token budgets, we observe a consistent trend: autoregressive models initially outperform diffusion models at low compute, but quickly saturate. Beyond a critical compute threshold, diffusion models continue improving and ultimately achieve better performance (Section 4.1)
2. Diffusion models benefit far more from repeated data. Prior work (muennighoff2023scaling,) showed that repeating the dataset up to 4 epochs is nearly as effective as using fresh data for autoregressive models. In contrast, we find that diffusion models can be trained on repeated data for up to 100 epochs, while having repeated data almost as effective as fresh data (Section 4.2).
3. Diffusion models have a much higher effective epoch count. Muennighoff et al. muennighoff2023scaling fit scaling laws for AR models in data-constrainted settings and define RD∗ as a learned constant that characterizes the number of epochs after which training more epochs results in significantly diminished returns. For autoregressive models, they estimate RD∗≈15 . In contrast, we find RD∗≈500 for diffusion models, suggesting they can benefit from repeated data over far more epochs without major degradation (Section 4.2).
4. Critical compute point follows a power law with dataset size. We find that the amount of compute required for diffusion models to outperform autoregressive models—the critical compute point—scales as a power law with the number of unique tokens. This yields a closed-form expression that predicts when diffusion becomes the favorable modeling choice for any given dataset size (Section 4.3).
5. Diffusion models yield better downstream performance. We find the above benefits extend beyond validation loss: the best diffusion model trained in data-constrained settings consistently outperform the best autoregressive model on a range of downstream language tasks (Section 4.4).
6. Exposure to different token orderings helps explain diffusion’s data efficiency. By adding explicit data augmentations to AR training, we find that diffusion models’ advantage arises from their exposure to a diverse set of token orderings. Essentially, the randomized masking in diffusion’s objective serves as implicit data augmentation, allowing it to generalize beyond the fixed left-to-right factorization of AR models. (Section 4.5)

Bayesian Optimisation

Thu, 17 Jul 2025 13:04:34 GMT

bayesian_opt

Thu, 17 Jul 2025 13:04:26 GMT

spatial_CV

Thu, 05 Jun 2025 08:38:03 GMT

Power diagram

Wed, 30 Apr 2025 12:23:37 GMT

A power diagram of four circles
In computational geometry, a power diagram, also called a Laguerre–Voronoi diagram, Dirichlet cell complex, radical Voronoi tesselation or a sectional Dirichlet tesselation, is a partition of the Euclidean plane into polygonal cells defined from a set of circles. The cell for a given circle C consists of all the points for which the power distance to C is smaller than the power distance to the other circles. The power diagram is a form of generalized Voronoi diagram, and coincides with the Voronoi diagram of the circle centers in the case that all the circles have equal radii.
from : Mad Max: Affine Spline Insights into Deep Learning

affine spline

Wed, 30 Apr 2025 12:14:57 GMT

figure :Single knots at and establish a spline of three cubic polynomials meeting with parametric continuity. Triple knots at both ends of the interval ensure that the curve interpolates the end pointsAn affine spline is a piecewise-defined function that generalizes the concept of traditional splines by allowing each segment to be an affine transformation of a base function. Here's a broad yet mathematically accurate definition:An affine spline is a function defined on an interval that is partitioned into subintervals for , where . On each subinterval , the function is given by:where: ( ) is a base function, often a polynomial or another type of function. ( ) is a linear transformation (which can be a matrix in higher dimensions). ( ) is a translation vector. The key properties of an affine spline are: Piecewise Definition: The function is defined piecewise over the subintervals. Affine Transformations: Each segment of the spline is an affine transformation of a base function. Continuity: The spline is typically continuous at the knots ( ), meaning for . Affine splines are useful in various applications, including computer graphics, animation, and data interpolation, where flexibility in the shape and transformation of the spline segments is required.
For more details see the wikipedia page
neural networks using linear activation units such as ReLu, can be interpreted as partitioning the input space into affine spline regions, also called Maximum Affine Spline Operator (MASO). One way to visualize them is some sort of voronoi diagram in a general sense called a Power diagram.
from : Mad Max: Affine Spline Insights into Deep Learning

MASO_partition

Wed, 30 Apr 2025 12:09:59 GMT

power_diagram

Wed, 30 Apr 2025 12:05:59 GMT

deep learning

Mon, 28 Apr 2025 15:15:43 GMT

article on Understand Deep Learning book
UDL book

deep_net_folding_input_space

Mon, 28 Apr 2025 15:14:14 GMT

Neural Networks are Elastic Origami

Mon, 28 Apr 2025 15:13:33 GMT

mSource : # Neural Networks Are Elastic Origami!

Neural networks are an affine spline : they partition space into groups for each classification class over the whole input space, like a honeycomb structure for vector space. To be precise, we can consider that the input space is divided into convex polytopes.Spline Theory Spline theory is a mathematical concept used to create smooth, piecewise-defined functions called splines. These functions are constructed from polynomial segments that are connected at points called knots, ensuring continuity and differentiability across the entire function. Splines are particularly useful for interpolation and approximation of data, providing a flexible and efficient way to model complex shapes and surfaces by adjusting the number and position of knots. In the context of neural networks, spline theory is applied to enhance model flexibility and performance. Spline-based activation functions can be used to introduce non-linearity in neural networks, allowing them to learn more complex patterns in data. Additionally, spline interpolation can be employed to preprocess data or to create smooth decision boundaries in classification tasks. Emergent behavior from neural nets emerges from the phenomenon that learning from the data in a part of the space impacts the rest of the space, even where there is no data, such that it can extrapolate to unseen results.
Deep Networks Always Grok and Here is WhyGrokking Grokking is a phenomenon observed in machine learning where a model's performance on a training dataset continues to improve even after it has achieved perfect accuracy, leading to better generalization on unseen data. This process involves the model refining its internal representations to better capture the underlying structure of the data, rather than simply memorizing the training examples. Grokking typically occurs when the model is trained for an extended period, allowing it to develop more robust and generalizable features. This concept highlights the importance of prolonged training and the potential for models to improve beyond apparent convergence, enhancing their ability to generalize and perform well on new, unseen data.

In the context of grokking, local complexity refers to the intricate patterns and structures within the data that a model learns to recognize over extended training periods. As a model "groks" the data, it refines its understanding of these local complexities, improving its ability to generalize beyond the training set. This involves capturing subtle, non-obvious relationships and features that are not immediately apparent but contribute to the model's overall performance. By focusing on local complexity, the model can better distinguish between relevant patterns and noise, leading to more robust and accurate predictions.Spurious correlation refers to a statistical relationship between two variables that appears to be significant but is actually coincidental or caused by a third, unrelated factor. This type of correlation does not imply causation and can lead to misleading conclusions if not carefully analyzed. Spurious correlations often arise due to random chance, data noise, or the influence of external variables that affect both datasets independently. Recognizing and addressing spurious correlations is crucial in data analysis and scientific research to ensure that conclusions are based on genuine relationships rather than coincidental associations.Curriculum learning is a training strategy in machine learning inspired by the way humans learn, where a model is trained on tasks of increasing difficulty over time. Instead of presenting all data at once, the model is first exposed to simpler examples or subtasks, gradually progressing to more complex ones. This approach helps the model to build a strong foundation of basic concepts before tackling more challenging problems, potentially leading to better generalization and faster convergence. By structuring the learning process in this manner, curriculum learning aims to improve the model's ability to learn complex patterns and enhance its overall performance on the target task.In machine learning, eigenspace and eigenvalues are fundamental concepts in linear algebra that play crucial roles in various algorithms and techniques. Eigenspace refers to the space spanned by the eigenvectors of a matrix, which are the directions that remain unchanged in transformation except for scaling by their corresponding eigenvalues. Eigenvalues indicate the magnitude of these scalings, providing insights into the importance or variance captured by each eigenvector. These concepts are pivotal in dimensionality reduction techniques like Principal Component Analysis (PCA), where data is projected onto a lower-dimensional eigenspace to retain the most significant variations.
linear probing, is a technique used to evaluate the quality of representations learned by neural networks, particularly in transfer learning and self-supervised learning contexts. A linear probe involves training a simple linear classifier on top of the frozen representations (features) extracted from a pre-trained model. The performance of this classifier serves as an indicator of how well the original model's features generalize to new tasks without further fine-tuning. This method helps assess the effectiveness of the learned representations in capturing relevant information for downstream tasks.

Pasted image 20250428171327

Mon, 28 Apr 2025 15:13:27 GMT

spline_img

Mon, 28 Apr 2025 14:13:51 GMT

convex polytopes

Mon, 28 Apr 2025 14:09:27 GMT

A convex polytope is a special case of a polytope, having the additional property that it is also a convex set contained in the -dimensional Euclidean space. Most texts1 2 use the term "polytope" for a bounded convex polytope, and the word "polyhedron" for the more general, possibly unbounded object. Others3 (including this article) allow polytopes to be unbounded. The terms "bounded/unbounded convex polytope" will be used below whenever the boundedness is critical to the discussed issue. Yet other texts identify a convex polytope with its boundary.
Convex polytopes play an important role both in various branches of mathematics and in applied areas, most notably in linear programming.
In the influential textbooks of Grünbaum1 and Ziegler2 on the subject, as well as in many other texts in discrete geometry, convex polytopes are often simply called "polytopes". Grünbaum points out that this is solely to avoid the endless repetition of the word "convex", and that the discussion should throughout be understood as applying only to the convex variety (p. 51).A polytope is called full-dimensional if it is an -dimensional object in .
Broadly speaking, convex polytopes are sets of points in Euclidean space that define a volume without curved elements, dimples, or self-intersections. Any line segment whose endpoints are contained in the set, should be completely contained within the set. Basic examples include the triangle, the square, or the cube. Counterexamples include the sphere, the star, or toroidal polyhedra.
More technically, convex polytopes are a subclass of convex sets. They can defined either through having finitely many vertices, or finitely many facets – a foundational result implies these definitions are equivalent.
Convex polytopes are generally much more well-studied than their abstract counterparts as they have many important applications in science and engineering, particularly convex optimization](https://en.wikipedia.org/wiki/Convex_optimization "convex optimization"). Many sources simply use the word "polytope" to mean "convex polytope".
Neural networks, when trained using a Linear activation function such as ReLu, form an affine spline boundary of the input space, which can be described mathematically as a specific type of convex polytope. More details can be found in the Neural Networks are Elastic Origami article.

convex_polytope

Mon, 28 Apr 2025 13:56:35 GMT

The Tunnel Effect - Building Data Representations in Deep Neural Networks

Thu, 24 Apr 2025 14:53:35 GMT

paper link
The central idea is the "Tunnel Effect": sufficiently deep and overparameterized neural networks trained on supervised tasks (like image classification) naturally split into two distinct parts: an "extractor" and a "tunnel". Extractor: These are the initial layers of the network. Their primary role is to build increasingly complex and linearly-separable representations of the input data. A large portion of the network's final performance on the training task is achieved by the layers within the extractor.
Tunnel: These are the subsequent, deeper layers following the extractor. This part primarily compresses the representations created by the extractor, significantly reducing their dimensionality through dimensionality reduction (measured, for instance, by numerical rank). These tunnel layers contribute minimally to improving the final performance on the in-distribution task.
Figure 3 reveals that for VGG-19 the inter-class representations variation decreases throughout the tunnel, meaning that representations clusters contract towards their centers. At the same time, the average distance between the centers of the clusters grows (inter-class variance).
Emergence: This extractor-tunnel structure emerges early in the training process and remains stable throughout the rest of training. The split is observable not just in the representations but also in the parameter space (tunnel layers change less during training) Dependence on Capacity and Complexity: The length of the tunnel increases with the network's capacity (depth or width). Deeper networks tend to have longer tunnels, suggesting the network allocates a fixed capacity for the extractor part for a given task. The tunnel length is inversely related to task complexity; tasks with fewer classes result in longer tunnels. The number of training samples doesn't seem to impact tunnel length as much as the number of classes. Out-of-Distribution (OOD) Generalization: The compression happening in the tunnel significantly degrades the model's performance on OOD data . Performance on OOD tasks often peaks at the layer just before the tunnel begins and declines thereafter . This drop in OOD performance is strongly correlated with the decrease in the representations' numerical rank within the tunnel . In essence, the paper argues that in deep networks, beyond a certain point (the end of the extractor), additional layers primarily serve to compress representations (forming the tunnel), which helps slightly refine in-distribution performance but harms OOD generalization and has mixed implications for continual learning.Okay, let's relate the "Tunnel Effect" described by Masarczyk et al. in the provided document to the "Grokking" phenomenon, focusing on the apparent contradiction regarding feature rank and out-of-distribution (OOD) generalization.1. Defining the Phenomena: Tunnel Effect (Masarczyk et al.): In sufficiently deep/overparameterized networks trained for supervised classification, the layers split spatially into an 'extractor' (builds linearly separable features) and a 'tunnel' (compresses these features). This compression within the tunnel involves a decrease in the numerical rank of representations. Critically, this paper finds that the tunnel and the associated rank decrease degrade OOD generalization performance. Performance on OOD tasks often peaks before the tunnel begins.
Grokking: This is a temporal phenomenon observed during training, often on algorithmic tasks or specific datasets. The model first achieves near-perfect training accuracy (memorization) while validation/test accuracy remains low. Then, after a potentially long period of continued training with little apparent progress, the validation/test accuracy suddenly jumps to near-perfect levels (generalization). Some studies suggest this transition to generalization coincides with the model finding a simpler, structured, or lower-rank solution. In this context, the eventual decrease in complexity/Feature matrix rank is associated with improved generalization. The phenomena are distinct and the context matters. Here's why the relationship between rank and OOD generalization might appear different: Spatial vs. Temporal: The Tunnel Effect is primarily a spatial phenomenon describing the function of different layers within a fully trained or partially trained network. Grokking is a temporal phenomenon describing a phase transition over training time. The rank decrease in the tunnel might signify discarding features useful for OOD but unnecessary for the primary task, while the simplification in grokking might signify finding the core, generalizable structure after having explored more complex, memorizing solutions. Nature of Compression/Simplification: In the Tunnel Effect, the compression happens after linearly separable features are formed by the extractor. This later-stage compression might aggressively optimize for the training distribution, discarding nuances needed for OOD tasks. It's akin to over-optimization on the training data's specific features. In Grokking, the simplification represents a shift from a complex, high-dimensional memorization solution to a simpler, underlying rule-based solution. This simplification is the mechanism of generalization in that context. Type of Rank/Complexity Measured: The Tunnel Effect paper specifically measures the numerical rank of layer activations (representations). Some Grokking literature discusses the rank of weight matrices or other complexity measures. These are not necessarily the same thing, and their dynamics might differ. Task and Dataset Differences: The Tunnel Effect was demonstrated on standard image classification tasks (CIFAR, CINIC). Grokking is often studied on algorithmic tasks (like modular arithmetic) or smaller datasets where the dynamics of memorization vs. generalization are more distinct and separable. The nature of generalization and the features required might differ significantly. Role of Regularization: Regularization (like weight decay) seems crucial for enabling the transition to the generalizing solution in Grokking; without it, models might stay stuck in the memorization phase. While the Tunnel Effect experiments used standard weight decay, the paper doesn't frame the tunnel itself as primarily driven by the search for low-weight-norm solutions in the same way Grokking literature sometimes does. While both phenomena involve changes in representation rank or complexity, they describe different processes. The Tunnel Effect describes how deeper layers in standard classification networks compress features, potentially harming OOD performance by discarding useful variance. Grokking describes a delayed phase transition during training where the network eventually discovers a simpler, underlying structure necessary for generalization, which might involve some form of rank reduction or simplification. The contrasting correlation with OOD generalization stems from these fundamental differences in mechanism, context, and potentially the specific metrics used.

tunnel_compressed_representations

Thu, 24 Apr 2025 14:49:21 GMT

Feature matrix rank

Thu, 24 Apr 2025 14:43:43 GMT

The rank of a feature matrix in the context of neural networks refers to the maximum number of linearly independent rows or columns in the matrix. It is a measure of the dimensionality of the space spanned by the features and indicates how much information is being captured and processed by the network at each layer.
the paper Deep Grokking: Would Deep Neural Networks Generalize Better? suggests that the decreasing of feature ranks and the transition to Grokking are linked. There is a double descent phenomenon in the feature ranks during grokking, suggesting it indicates the model's capacity for generalisation.In more detail : "We also estimate the layer-wise numerical rank of feature representations to reflect to what extent the learnt in- ternal features can be compressed. Specifically, we compute the singular values of the covariance matrix of the output activation W ∈ Rn×d from an internal layer l. We then measure its numerical rank as the number of singular val- ues above a certain threshold given a relative tolerance ε: σ = max(n, d) · ε"The rank of a matrix can be determined using various methods, including:
Singular Value Decomposition (SVD): This is a common method where the matrix is decomposed into three other matrices. The number of non-zero singular values in the diagonal matrix gives the rank of the original matrix. Row or Column Reduction: By performing Gaussian elimination or other row/column operations, you can transform the matrix into a form where the rank can be directly observed (e.g., row echelon form). Determinant of Submatrices: For smaller matrices, the rank can be determined by finding the largest square submatrix with a non-zero determinant. The rank of a feature matrix is closely related to the complexity of the representations learned by a neural network: High Rank: A high-rank feature matrix indicates that the network is capturing a large number of independent features or patterns in the data. This can be beneficial for learning complex representations but may also increase the risk of overfitting, especially if the data is not sufficiently diverse. Low Rank: A low-rank feature matrix suggests that the network is capturing fewer underlying factors. This can be a sign of simpler, more generalizable representations, but if the rank is too low, the network might not be capturing enough information to make accurate predictions. Rank and Generalization: In the context of the grokking phenomenon, a decrease in feature rank is observed as the network transitions from overfitting to generalization. This implies that the network is learning to represent the data more efficiently, using fewer but more meaningful features . Understanding the rank of feature matrices can provide insights into how neural networks process and represent information, helping to diagnose issues related to overfitting, underfitting, and generalization.

dimensionality reduction

Thu, 24 Apr 2025 14:42:42 GMT

tunnel_effect

Thu, 24 Apr 2025 14:11:43 GMT

Linear Probe

Tue, 22 Apr 2025 13:07:20 GMT

Linear probing in the context of deep neural networks refers to a technique used to analyze and interpret the representations learned by these networks.
It involves training a simple linear classifier, known as a probe, on top of the frozen representations (features) extracted from a pre-trained neural network. The goal is to assess how well these representations capture certain properties or tasks without further fine-tuning the original model.
paper :Understanding intermediate layers using linear classifier probes Purpose: Linear probing is primarily used to understand what information is encoded in the intermediate layers of a neural network. By training a linear classifier on these representations, researchers can evaluate whether the features learned by the network are useful for a specific task or property. Methodology: The process involves freezing the weights of the pre-trained network and using its outputs (or intermediate layer activations) as input features for a linear classifier. This classifier is then trained to predict a target property or label. The performance of this linear classifier serves as an indicator of the quality and informativeness of the learned representations. Interpretability: Linear probes are preferred over more complex classifiers because they are simpler and more interpretable. If a linear classifier can achieve high performance, it suggests that the representations contain readily accessible information relevant to the task. Conversely, if a linear classifier performs poorly, it may indicate that the representations do not capture the necessary information or that the information is not linearly separable. Limitations: While linear probing provides insights into the representations learned by neural networks, it has limitations. For example, a linear probe might not capture complex, non-linear relationships that could be important for certain tasks. Additionally, the performance of the probe can be influenced by factors such as the choice of the probe architecture and the dataset used for training. Overall, linear probing is a valuable tool for analyzing and interpreting deep neural networks, helping researchers understand what these models learn and how they represent information

linear_probing_example

Thu, 17 Apr 2025 13:16:02 GMT

overfitting

Tue, 15 Apr 2025 14:26:17 GMT

A Machine Learning model overfits when it learn by memorization all the possible examples in the training set and fits them in an overly complex manner that does not generalize well to the test set or beyond.

In the particular case of neural networks, a phenomenon called Grokking can occur beyond the overfitting step, where the model learns a more compact and sparse solution to the

overfitting_illustratee

Tue, 15 Apr 2025 12:42:19 GMT

neural networks

Fri, 11 Apr 2025 12:06:57 GMT

A neural network, in essence, is an universal function approximator.
In the picture above, the blue function is approximated by a 6-neurons network, each responsible for a segment of the fitted curve, using a Rectified linear unit function, noted as Relu, that is detailed in the Activation functions section.
In other words, given a set of samples of data from an unknown Probability Distribution, we can approximate this unknown distribution given enough samples. This becomes useful in many fields of science, where given measurements, we wish to predict some continuous future behavior (regression) or discrete category (classification).
From Statistics, we assume that any measure of real-world phenomenon is a sample from a Probability Distribution, so we can consider that neural networks are an Anything-to-Anything model, as long as we can numerically quantify and measure it.
There are many methods to achieve exactly that, from the simple linear to polynomial regressions, to decision trees, Random forest and most of Machine Learning.
From # Universal approximation theorem, it states that any function can be approximated by an arbitraily large neural network (is proven in the general sense ) :Multilayer feedforward networks are universal approximators This paper rigorously establishes that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available. In this sense, multilayer feedforward networks are a class of universal approximators. In simple terms, a Borel measurable function is like a rule that ensures when you measure something in the output, you can also measure the corresponding part in the input using the same kind of measurement tools (Borel sets). This is important in fields like probability and statistics to ensure that functions behave nicely with the kinds of sets we can measure.
In practice, we limit our measured functions to differentiable functions to be able to compute its gradient for the famous Gradient Descent optimization algorithm.
Neural networks are like logical lego lasagna, they are modular like lego bricks by design and stack layer by layer like a lasagna. When broken down to the modular level, they are pretty simple ! To understand more complex architectures, you only need to learn the new lego blocks, the rest remains mostly the same !
Hardware optimization This allows also for hardware optimization because most computing are parallelizable on special hardware such as Graphical Processing Units or GPUs, or even AI-specialized hardware such as TPU. For example, the activation for each neuron in a layer can be computed independently, and if we train the data by batches or packets, we can compute the neuron activation for each data point in the batch in parallel. for example : given a batch of 64 samples, with a 64-neurons layer, we can compute operations in parallel.
A perceptron is a fundamental concept in the field of artificial intelligence and machine learning, serving as the basic building block of neural networks. A perceptron is a simple model of a neuron, the basic unit of the brain.In summary, a perceptron is a simple model of a neuron that takes inputs, applies weights, and produces an output based on an activation function. It was a pioneering concept in the field of artificial intelligence, paving the way for more advanced neural networks.
How it works : Inputs: These are the data points or features that the perceptron considers. Weights: Each input has an associated weight, which determines the importance of that input, noted w1,w2 ... wn here. Bias: An additional parameter b that helps to fit the model better by shifting the activation function. Activation Function: This function decides whether the perceptron "fires" (produces an output) based on the weighted sum of inputs. Commonly, a step function is used, which outputs 1 if the sum exceeds a threshold and 0 otherwise. This function decides whether the perceptron "fires" (produces an output) based on the weighted sum of inputs. Commonly, a step function is used, which outputs 1 if the sum exceeds a threshold and 0 otherwise.
The most commonly used is the Rectified linear unit function, or ReLu :
So for a given neuron of and with a base function we have :Without going into details, there is a whole family of activation functions used in practice depending on the problem, and is an important design choice when building a neural network.
The most simple architecture (and the first) to exist is the following, connecting multiple perceptrons in layers, hence the name :
A Multi-Layer Perceptron (MLP) is a type of artificial neural network that consists of multiple layers of interconnected nodes, or "neurons," inspired by the structure of the human brain. Structure: Input Layer: The first layer that receives input data. Each node represents a feature of the data. Hidden Layers: One or more layers between the input and output layers. These layers perform computations and extract features from the input data. Output Layer: The final layer that produces the output of the network. Each node represents a possible output or class. How It Works: Data flows from the input layer through the hidden layers to the output layer. Each node in a layer is connected to every node in the next layer. Each connection has a weight that the network learns during training. Nodes apply an activation function to the weighted sum of their inputs to introduce non-linearity, allowing the network to learn complex patterns. Learning Process:
The MLP learns by adjusting the weights of the connections using an algorithm called backpropagation algorithm. During training, the network compares its predictions to the actual outputs and updates the weights to minimize the error. This process is repeated over many iterations to improve the network's accuracy. Applications: MLPs are used for various tasks, including classification, regression, and pattern recognition. They are versatile and can be applied to problems in image and speech recognition, natural language processing, and more.
In summary, a Multi-Layer Perceptron is a neural network with multiple layers of interconnected nodes that learn to recognize patterns in data through a process of weight adjustment and using the backpropagation algorithm. It is a foundational model in the field of deep learning.

The main algorithm behind training neural networks is called backpropagation algorithm, and will be detailed further in its own article.
As a video is worth a thousand words, the best visual intuition for backpropagation is here from 3blue1Brown # Backpropagation, intuitively | DL3.In short, the backpropagation algorithm is a fundamental method used to train artificial neural networks, including Multi-Layer Perceptrons (MLPs).
It works by minimizing the error between the network's predicted outputs and the actual target values through a process called Gradient Descent. This algorithm is a subclass of algorithm dedicated to solving optimization problems, where we find the minimum of a given function as fast as possible.Two great videos covering this topic to build intuition :
# Watching Neural Networks Learn
# Gradient Descent vs Evolution | How Neural Networks Learn
During training, data is fed forward through the network to generate predictions. The error is then calculated using a loss function, which compares the prediction to the provided ground-truth, and this error is propagated backward through the network. It is this error, or loss, that is optimized through the Gradient Descent.
The algorithm computes the gradient of the loss function with respect to each weight by applying the chain rule of calculus, determining how much each weight contributed to the error. The weights are then updated in the opposite direction of the gradient to reduce the error, iteratively improving the network's performance. This process is repeated over many epochs (whole dataset), allowing the network to learn and refine its internal representations of the data.Voilà ! You now have a fully trained neural network ! 😎There are many more steps to building a model fit for a complex problem, that are outside the model itself such as :
hyperparameter tuning
Neural network architecture design such as Multi-task Learning or Multi-output Regression Neural Network
Hardware optimization with GPU for local ML workflow
and language specific tasks such as learning python and its machine learning librairies such as Tensorflow, Keras, Parallelism in python, Running jupyter or IDE on WSL2, or installing Linux for GPU usage using WSL2 Ubuntu 22.04+Windows 10.

index

Fri, 11 Apr 2025 10:21:56 GMT

Welcome to this personal digital garden, the concept is to have some interconnected ideas that can be navigated using the Obsidian-generated graph.Here are some notes you can start reading :
Installer une LLM en local pour un humain local
RAG simple, local et Open-Source avec GPT4All
Installer DeepSeek R1 Distill en local
Guide - Utiliser MistralThere are some visual mind maps :
Ecosystème EVSAnd some fundamentals of Deep Learning or model memos :
neural networks
Latent spaces
Principal Component Analysis
Gradient Boosting
Regularization

bias_variance_curve

Tue, 08 Apr 2025 12:54:32 GMT

bias_variance_tradeoff

Tue, 08 Apr 2025 12:52:38 GMT

Central Limit Theorem

Tue, 08 Apr 2025 08:31:43 GMT

In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the distribution of a normalized version of the sample mean converges to a standard normal distribution. This holds even if the original variables themselves are not normally distributed. There are several versions of the CLT, each applying in the context of different conditions.The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.
In other words, suppose that a large sample of observations is obtained, each observation being randomly produced in a way that does not depend on the values of the other observations, and the average (arithmetic mean) of the observed values is computed. If this procedure is performed many times, resulting in a collection of observed averages, the central limit theorem says that if the sample size is large enough, the probability distribution of these averages will closely approximate a normal distribution.
The central limit theorem has several variants. In its common form, the random variables must be independent and identically distributed (i.i.d.). This requirement can be weakened; convergence of the mean to the normal distribution also occurs for non-identical distributions or for non-independent observations if they comply with certain conditions.

architecture Design

Fri, 04 Apr 2025 15:25:25 GMT

backpropagation

Fri, 04 Apr 2025 15:00:59 GMT

how_to_train_nn

Fri, 04 Apr 2025 14:48:42 GMT

Lego_network

Fri, 04 Apr 2025 14:21:40 GMT

activation_functions_family

Fri, 04 Apr 2025 14:09:48 GMT

ReLu

Fri, 04 Apr 2025 14:08:43 GMT

Perceptron_neuron

Fri, 04 Apr 2025 14:02:38 GMT

Central limit theorem

Fri, 04 Apr 2025 13:53:01 GMT

Probability Distribution

Fri, 04 Apr 2025 13:52:23 GMT

A probability distribution is a mathematical function that describes the likelihood of different outcomes in a random experiment or situation. It tells us how probable each possible outcome is, providing a way to understand and predict uncertain events. Random Variable: This is a variable whose possible values are outcomes of a random phenomenon. For example, the number of heads in a coin toss or the height of a randomly chosen person. Probability Distribution: This is a description of the probabilities of all possible values that a random variable can take. It can be visualized as a graph or a table showing the likelihood of each outcome. Types of Distributions: Discrete Distribution: Used for random variables that can take on specific, separate values (e.g., rolling a die). Continuous Distribution: Used for random variables that can take any value within a range (e.g., measuring height). Rolling a Die (Discrete Distribution): Random Variable: The number shown on the die. Possible Outcomes: 1, 2, 3, 4, 5, 6. Probability Distribution: Each number has an equal probability of 1/6. This is an example of a uniform distribution because all outcomes are equally likely. Heights of Adults (Continuous Distribution): Random Variable: The height of a randomly selected adult. Possible Outcomes: Any value within a reasonable range (e.g., 4 feet to 7 feet). Probability Distribution: Often modeled using a normal distribution, where most heights cluster around an average (mean), with fewer people being very short or very tall. Understanding probability distributions helps in making informed decisions under uncertainty. It's used in fields like finance to model stock prices, in insurance to assess risk, and in science to analyze experimental data. By knowing the distribution, we can predict outcomes, assess risks, and make better decisions.In summary, a probability distribution is a powerful tool that helps us quantify and understand the likelihood of different outcomes in a random situation, providing insights and aiding in decision-making.

Probability distributions

Fri, 04 Apr 2025 13:49:25 GMT

MLP

Fri, 04 Apr 2025 13:38:51 GMT

Relu_network

Fri, 04 Apr 2025 13:31:41 GMT

Parallelism in python

Tue, 01 Apr 2025 08:42:17 GMT

When discussing parallel processing in Python, the terms "CPU-bound" and "I/O-bound" are often used to describe the nature of the tasks being performed. Understanding these terms is crucial for choosing the right parallel processing strategy. Definition: CPU-bound tasks are those that spend most of their time performing computations on the CPU. Examples include numerical computations, data processing, and complex algorithmic operations. Characteristics: The performance of these tasks is limited by the speed of the CPU. They typically involve intensive calculations and require significant processing power. Parallel Processing: For CPU-bound tasks, true parallelism is often necessary to achieve performance gains. This means using multiple CPU cores to perform computations simultaneously. In Python, due to the Global Interpreter Lock (GIL), threads from the threading module or ThreadPoolExecutor do not achieve true parallelism for CPU-bound tasks. The GIL allows only one thread to execute Python bytecode at a time, which can be a bottleneck for CPU-bound tasks. Suitable Tools: ProcessPoolExecutor: Uses separate processes, bypassing the GIL and allowing true parallelism. multiprocessing: Similar to ProcessPoolExecutor, it uses separate processes for parallel execution. Libraries like Dask or Ray: Designed to handle parallel and distributed computing, including CPU-bound tasks. Definition: I/O-bound tasks are those that spend most of their time waiting for input/output operations, such as reading from or writing to disk, network communication, or user input. Characteristics: The performance of these tasks is limited by the speed of I/O operations rather than CPU speed. They often involve waiting for external resources, which can lead to idle CPU time. Parallel Processing: For I/O-bound tasks, threading can be an effective way to achieve concurrency. Since these tasks spend a lot of time waiting, multiple threads can make progress while others are waiting. The GIL is less of an issue for I/O-bound tasks because the threads spend much of their time waiting for I/O operations to complete, rather than executing Python bytecode. Suitable Tools: ThreadPoolExecutor: Efficient for I/O-bound tasks due to lower overhead compared to process-based parallelism. asyncio: Allows for asynchronous I/O operations, which can be more efficient than threading for some I/O-bound tasks. CPU-bound tasks require true parallelism, which is best achieved using process-based parallelism to bypass the GIL. I/O-bound tasks can benefit from threading or asynchronous I/O, as they spend much of their time waiting for external resources. Understanding whether your task is CPU-bound or I/O-bound is essential for choosing the right parallel processing strategy and achieving optimal performance.

local_complexity

Fri, 28 Mar 2025 15:49:38 GMT

Spline theory

Fri, 28 Mar 2025 14:10:06 GMT

grokking

Fri, 28 Mar 2025 14:06:56 GMT

Data workflow

Fri, 28 Mar 2025 13:34:21 GMT

When dealing with a dataset that fits into CPU RAM but not into GPU VRAM, the goal is to efficiently manage data transfer between CPU and GPU to maximize GPU utilization. Here's a high-level overview of the workflow: Data Preparation: Load the dataset into CPU RAM (entire dataset or batch subset). Preprocess the data (e.g., normalization, augmentation) on the CPU. Batch Processing: Split the dataset into mini-batches that fit into GPU VRAM. Use a data loader to handle batching and shuffling. Data Transfer: Transfer mini-batches from CPU to GPU memory as needed. Ensure that data transfer is efficient and does not become a bottleneck. Model Inference: Perform inference on the GPU using the transferred mini-batches. Collect the results and transfer them back to CPU memory if needed. Post-Processing: Aggregate the results from all mini-batches on the CPU. Perform any additional post-processing or analysis. Here's a high-level implementation using Keras for inference with a Fully Connected Network (FCN):import numpy as np import tensorflow as tf from tensorflow.keras.models import load_model # Load the entire dataset into CPU RAM data = np.load('path_to_data.npy') # Example: replace with actual data loading # Preprocess the data on the CPU # Example: normalization data = data / 255.0 # Define batch size that fits into GPU VRAM batch_size = 64 # Adjust based on your GPU VRAM # Load the pre-trained model model = load_model('path_to_model.h5') # Function to perform inference in batches def infer_in_batches(data, model, batch_size): predictions = [] num_batches = int(np.ceil(len(data) / batch_size)) for i in range(num_batches): # Get the current batch batch_data = data[i * batch_size:(i + 1) * batch_size] # Transfer the batch to GPU and perform inference batch_predictions = model.predict(batch_data) # Collect the results predictions.append(batch_predictions) # Concatenate all predictions predictions = np.concatenate(predictions, axis=0) return predictions # Perform inference results = infer_in_batches(data, model, batch_size) # Post-process the results on the CPU if needed # Example: save the results np.save('path_to_results.npy', results) Batch Size: Choose a batch size that fits into your GPU VRAM to avoid out-of-memory errors. Data Transfer: Minimize data transfer between CPU and GPU by processing data in batches. Efficient Loading: Use efficient data loading and preprocessing techniques to avoid bottlenecks. Model Optimization: Consider optimizing the model for inference, such as using mixed precision or quantization, to reduce VRAM usage. This workflow ensures that you can efficiently utilize your GPU for inference even when the dataset is larger than the available VRAM.

Mixed Precision

Mon, 24 Mar 2025 10:03:53 GMT

Mixed precision training and inference leverages both 16-bit and 32-bit floating-point types to optimize performance on modern GPUs. This technique accelerates computations and reduces memory usage without significantly compromising model quality. Floating-Point Precision: Traditional deep learning models use 32-bit floating-point precision (FP32) for computations. Mixed precision introduces 16-bit floating-point precision (FP16) to speed up calculations and reduce memory footprint. Automatic Mixed Precision (AMP): Frameworks like TensorFlow and PyTorch offer AMP, which automatically identifies parts of the model that can use FP16 without losing stability. Loss Scaling: To prevent underflow in gradients, loss scaling is employed. This involves multiplying the loss by a scale factor before backpropagation and adjusting gradients accordingly. Performance Gains: Mixed precision can significantly accelerate training and inference times, especially on GPUs with Tensor Cores that support FP16 operations. Memory Efficiency: Reduced memory usage allows for training larger models or using larger batch sizes, which can improve model convergence and stability. Minimal Loss in Accuracy: Despite using lower precision, the loss in model accuracy is often negligible. Techniques like loss scaling help maintain numerical stability. Empirical Validation: Extensive testing and benchmarks have shown that mixed precision training can achieve comparable results to full FP32 training across various models and tasks. In Keras 3, you can perform inference using mixed precision on a model that was originally trained with FP32 precision. This involves converting the model to use mixed precision during the inference phase. Here's a step-by-step guide on how to achieve this: Install Necessary Libraries: Ensure you have TensorFlow and Keras installed. Mixed precision is supported in TensorFlow 2.4 and later. Enable Mixed Precision Policy: Set the mixed precision policy to use FP16 for inference. This can be done using TensorFlow's mixed precision API. Load the Model: Load your pre-trained FP32 model. Convert the Model: Convert the model to use mixed precision. This involves casting certain layers to FP16 while keeping others in FP32 for numerical stability. Perform Inference: Use the converted model for inference. Here's a sample code snippet to illustrate these steps:import tensorflow as tf from tensorflow.keras.models import load_model # Step 1: Set the mixed precision policy tf.keras.mixed_precision.set_global_policy('mixed_float16') # Step 2: Load your pre-trained FP32 model model = load_model('path_to_your_model.h5') # Step 3: Convert the model to use mixed precision # Keras will automatically handle the conversion when the policy is set # Step 4: Perform inference # Assuming `input_data` is your input data for inference input_data = ... # Load or prepare your input data predictions = model.predict(input_data) print(predictions) Mixed Precision Policy: The mixed_float16 policy automatically handles the conversion of layers to FP16 where possible, while keeping certain operations in FP32 to maintain numerical stability. Compatibility: Ensure your hardware (e.g., NVIDIA GPUs with Tensor Cores) supports FP16 operations for optimal performance. Performance: Mixed precision can significantly speed up inference and reduce memory usage, making it suitable for deployment on resource-constrained environments. By following these steps, you can leverage mixed precision to enhance the performance of your model during inference in Keras 3.

inference_workflow

Fri, 21 Mar 2025 15:05:52 GMT

The Bitter Lesson - Rich Sutton

Fri, 07 Mar 2025 14:25:30 GMT

March 13, 2019 https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf
Le texte discute des leçons tirées de 70 ans de recherche en intelligence artificielle (IA), en soulignant que les méthodes générales qui exploitent le calcul sont les plus efficaces à long terme. Cette efficacité est principalement due à la loi de Moore, qui prédit une diminution exponentielle du coût par unité de calcul. Les chercheurs en IA ont souvent tenté d'intégrer des connaissances humaines dans leurs systèmes, ce qui peut améliorer les performances à court terme mais finit par limiter les progrès à long terme. En revanche, les méthodes basées sur la recherche massive et l'apprentissage automatique, qui tirent parti de l'augmentation de la puissance de calcul, ont montré des résultats bien supérieurs.
Des exemples notables incluent les échecs par ordinateur, où les méthodes basées sur la recherche profonde ont surpassé celles basées sur la compréhension humaine du jeu. Un schéma similaire a été observé dans le Go par ordinateur, la reconnaissance vocale et la vision par ordinateur. Dans chaque cas, les approches statistiques et computationnelles ont finalement dominé celles basées sur les connaissances humaines.La leçon amère est que les chercheurs doivent éviter de construire des systèmes basés sur leur propre compréhension de la pensée humaine, car cela complique les méthodes et les rend moins adaptées à l'exploitation de la puissance de calcul croissante. Les méthodes générales, comme la recherche et l'apprentissage, qui peuvent évoluer avec l'augmentation du calcul, sont les plus prometteuses. Les chercheurs doivent se concentrer sur la création de méthodes capables de découvrir et de capturer la complexité arbitraire du monde, plutôt que d'intégrer des découvertes humaines spécifiques.

Conda solver in 22.11+

Fri, 07 Mar 2025 13:49:54 GMT

This 22.11 release of conda brings a suite of new improvements and features that have been highly anticipated by the community. Conda runs faster now—much faster. Here is an abbreviated list of notable changes: The faster conda-libmamba-solver is no longer marked as experimental. Conda package downloads are now parallelized. Conda is now plugin ready, with a new architecture. If you have not yet installed conda, you may do so by installing the Anaconda Distribution or Miniconda. If you have conda installed already (you can check in your terminal by entering “conda info”), all you need to do is update conda:conda info conda update -n base conda
Back in March of this year, the conda team introduced a new experimental solver, conda-libmamba-solver, as part of our collaborative effort with Quansight to make conda faster. Speed improvement has been one of conda’s most highly requested updates, and after almost a year of joint development work, we are incredibly excited to share that the conda-libmamba-solver has now joined the default solver club.To install and set the new solver, run the following commands:conda install -n base conda-libmamba-solver conda config --set solver libmamba
You may also refer to this comprehensive Getting Started guide. The classic solver is still available.

Théorie de l'Information de Shannon

Fri, 07 Mar 2025 10:09:05 GMT

La Théorie de l'Information, développée par Claude Shannon dans les années 1940, est une branche des mathématiques et de l'ingénierie qui étudie la quantification, le stockage et la communication de l'information. Elle a révolutionné notre compréhension de la transmission des données et a jeté les bases des technologies modernes de communication. Entropie : L'entropie mesure le degré d'incertitude ou de désordre dans un ensemble de données. Plus l'entropie est élevée, plus l'information contenue dans les données est riche et imprévisible. Capacité du Canal : La capacité d'un canal de communication est la quantité maximale d'information qui peut être transmise sans erreur par unité de temps. Shannon a démontré que pour tout canal, il existe une capacité maximale qui dépend des caractéristiques du canal. Codage de Source et de Canal : Le codage de source vise à compresser les données pour réduire la redondance, tandis que le codage de canal ajoute de la redondance pour corriger les erreurs de transmission. L'échantillonnage est le processus de conversion d'un signal continu en une séquence discrète de valeurs. C'est une étape cruciale dans la transmission et le traitement des signaux numériques. Théorème d'Échantillonnage de Nyquist-Shannon : Ce théorème stipule qu'un signal peut être parfaitement reconstruit à partir de ses échantillons si la fréquence d'échantillonnage est au moins deux fois la fréquence maximale du signal (fréquence de Nyquist). Si cette condition n'est pas respectée, il se produit un phénomène appelé "repliement" ou "aliasing", qui déforme le signal. Applications Pratiques : L'échantillonnage est utilisé dans de nombreux domaines, notamment la télécommunication, le traitement audio, la vidéo numérique et les systèmes de contrôle. Il permet de convertir des signaux analogiques en signaux numériques qui peuvent être traités par des ordinateurs. Prenons l'exemple d'un signal audio. Pour numériser un signal audio, il est échantillonné à une fréquence supérieure à deux fois la fréquence maximale du signal (par exemple, 44,1 kHz pour les CD audio). Chaque échantillon représente l'amplitude du signal à un instant donné. Ces échantillons sont ensuite quantifiés et codés en format numérique.L'échantillonnage peut également être appliqué à des mesures environnementales, comme l'acidité édaphique (pH du sol). Le lien entre la fréquence d'échantillonnage d'un signal et la résolution spatiale d'un signal 2D, comme une carte, peut être expliqué en termes de densité d'information et de précision des détails capturés.La résolution spatiale d'un signal 2D, comme une carte, fait référence à la densité des points de données dans l'espace. Une résolution spatiale plus élevée signifie que plus de points de données sont capturés par unité de surface, ce qui permet de représenter des détails plus fins et des variations plus subtiles dans l'espace. Résolution Spatiale : Si nous créons une carte 2D d’une région, la résolution spatiale détermine combien de points de données nous capturons par unité de surface. Une résolution spatiale plus élevée permet de représenter des détails fins comme des petites collines ou des ravins. Collecte des Échantillons : À chaque intervalle d'échantillonnage (chaque 250m), nous prélevons des échantillons de sol à différents points de la région agricole. Chaque échantillon est analysé pour déterminer son pH. Quantification et Codage : Les valeurs de pH mesurées sont quantifiées (par exemple, arrondies à une décimale) et codées en format numérique pour être stockées et analysées par un système informatique. La Théorie de l'Information de Shannon a profondément influencé le développement des technologies de communication modernes. L'échantillonnage, en particulier, est une technique fondamentale qui permet la conversion des signaux analogiques en signaux numériques, facilitant ainsi leur transmission et leur traitement.Que ce soit pour les signaux audio ou les mesures environnementales comme l'acidité édaphique, l'échantillonnage joue un rôle crucial dans la collecte et l'analyse des données.

WSL2 Ubuntu 22.04+Windows 10

Mon, 24 Feb 2025 10:49:01 GMT

Developers can access the power of both Windows and Linux at the same time on a Windows machine. The Windows Subsystem for Linux (WSL) lets developers install a Linux distribution (such as Ubuntu, OpenSUSE, Kali, Debian, Arch Linux, etc) and use Linux applications, utilities, and Bash command-line tools directly on Windows, unmodified, without the overhead of a traditional virtual machine or dualboot setup.[https://learn.microsoft.com/en-us/windows/wsl/install]You must be running Windows 10 version 2004 and higher (Build 19041 and higher) or Windows 11 to use the commands below. If you are on earlier versions please see the manual install page.You can now install everything you need to run WSL with a single command. Open PowerShell or Windows Command Prompt in user mode (for CarHab) enter the wsl --install command, then restart your machine.If stuck If the install is stuck at 0%, run wsl --update as admin first. By default, the installed Linux distribution will be Ubuntu. This can be changed using the -d flag. To change the distribution installed, enter: wsl --install -d . Replace with the name of the distribution you would like to install. To see a list of available Linux distributions available for download through the online store, enter: wsl --list --online or wsl -l -o. To install additional Linux distributions after the initial install, you may also use the command: wsl --install -d . 1. Turn off generation of /etc/resolv.confUsing your Linux prompt, (I'm using Ubuntu), modify (or create) /etc/wsl.conf with the following content[network] generateResolvConf = false (Apparently there's a bug in the current release where any trailing whitespace on these lines will trip things up.)2. Restart the WSL2 Virtual MachineExit all of your Linux prompts and run the following Powershell commandwsl --shutdown 3. Create a custom /etc/resolv.confOpen a new Linux prompt and cd to /etcIf resolv.conf is soft linked to another file, remove the link withrm resolv.conf Create a new resolv.conf with the following contentnameserver 161.X.X.XX 161.X.X.XX search univ-st-etienne.fr Replacing X with the correct server addresses.4. Restart the WSL2 Virtual MachineSame as step #25. Start a new Linux prompt.Profit!
TutorialLe fichier /etc/profile est lu au démarrage du système.Modifiez le fichier /etc/profile ajouter cette ligne à la fin du fichier:export http_proxy=http://"proxy_address":"port_number" où "proxy_address" et "port_number" seront adaptés à votre situationLe fichier /etc/bash.bashrc est lu au démarrage de votre terminal.Modifiez le fichier /etc/bash.bashrc Puis ajoutez à la fin du fichierexport http_proxy=http://"proxy_ip":"port_number" où "proxy_ip" et "port_number" seront adaptés à votre situation
wget permet à différents scripts de télécharger des données.modifier le fichier /etc/wgetrc recherchez et décommentez les lignes (enlevez les # en début de lignes)http_proxy = http://proxy.yoyodyne.com:18023/ use_proxy = on en remplaçant les paramètres du proxy.
apt est le programme qui télécharge et installe les mises à jour.Créer un fichier /etc/apt/apt.conf.d/proxyPerso.conf ajoutez la ligne suivanteAcquire::http::proxy "http://adresse:port/";en remplaçant adresse par l'adresse du proxy, et port par le port.Pour désactiver le proxy http ou https pendant la session de travail :$ unset http_proxy $ unset https_proxy ou $ export http_proxy='' $ export https_proxy='' vérifier avec $ printenv ou $ printenv https_proxy suivre la procédure inverse des chapitres précedents expliquant la modification des fichiers, en supprimant les lignes en question.You can check whether you have the correct Ubuntu version installed on your server with the following command: lsb_release -aYou should get this output:No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 22.04.1 LTS Release: 22.04 Codename: jammy Let’s update the package index on our Ubuntu 22.04 system : apt update The process is very simple; we need to download the installer script. We have to do this because Miniconda is not available in the default Ubuntu repository. So, let’s get it downloaded.sudo wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /opt/miniconda-installer.sh The Miniconda installer script has been downloaded and saved as /opt/miniconda-installer.shThe installation will be easy and straightforward. Simply execute the installer file.bash /opt/miniconda-installer.shFollow the instructions shown on the screen. We need to press ENTER to review the license agreement. Keep pressing ENTER or SPACE to finish it. , At the end of the agreement, you will be asked to accept the license terms or not. Type ‘yes’ to accept and continue.Next, you will be shown this option below.Miniconda3 will now be installed into this location: /root/miniconda3 - Press ENTER to confirm the location - Press CTRL-C to abort the installation - Or specify a different location below Just press ENTER and continue.Once the installation is finished, you will be prompted to start Miniconda or not.Do you wish the installer to initialize Miniconda3 by running conda init? [yes|no] [no] >>> Type ‘yes’, then hit ENTER. You should see this as an output.no change /root/miniconda3/condabin/conda no change /root/miniconda3/bin/conda no change /root/miniconda3/bin/conda-env no change /root/miniconda3/bin/activate no change /root/miniconda3/bin/deactivate no change /root/miniconda3/etc/profile.d/conda.sh no change /root/miniconda3/etc/fish/conf.d/conda.fish no change /root/miniconda3/shell/condabin/Conda.psm1 no change /root/miniconda3/shell/condabin/conda-hook.ps1 no change /root/miniconda3/lib/python3.11/site-packages/xontrib/conda.xsh no change /root/miniconda3/etc/profile.d/conda.csh modified /root/.bashrc ==> For changes to take effect, close and re-open your current shell. <== If you'd prefer that conda's base environment not be activated on startup, set the auto_activate_base parameter to false: conda config --set auto_activate_base false Thank you for installing Miniconda3! That’s it! You have just installed Miniconda.After the Miniconda installation, we need to apply the changes made to ~/.bashrc file. Miniconda installer modified the file during the installation. Let’s execute the command. source ~/.bashrc to activate : source /home/brazma/miniconda3/bin/activateNow, at this point, you can run this command to check your Miniconda information.conda infoYou will see this output: active environment : base active env location : /root/miniconda3 shell level : 1 user config file : /root/.condarc populated config files : conda version : 23.5.2 conda-build version : not installed python version : 3.11.4.final.0 virtual packages : __archspec=1=x86_64 __glibc=2.35=0 __linux=5.15.0=0 __unix=0=0 base environment : /root/miniconda3 (writable) conda av data dir : /root/miniconda3/etc/conda conda av metadata url : None channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 https://repo.anaconda.com/pkgs/main/noarch https://repo.anaconda.com/pkgs/r/linux-64 https://repo.anaconda.com/pkgs/r/noarch package cache : /root/miniconda3/pkgs /root/.conda/pkgs envs directories : /root/miniconda3/envs /root/.conda/envs platform : linux-64 user-agent : conda/23.5.2 requests/2.29.0 CPython/3.11.4 Linux/5.15.0-76-generic ubuntu/22.04 glibc/2.35 UID:GID : 0:0 netrc file : None offline mode : False If you want to update Miniconda, you can run this command:conda update --allIf updates are available, it will show you a list of packages to update, and you need to answer with yes to proceed with the update.(base) root@ubuntu22:~# conda update --all Collecting package metadata (current_repodata.json): done Solving environment: done Package Plan environment location: /root/miniconda3 The following packages will be downloaded: package | build ---------------------------|----------------- certifi-2023.7.22 | py311h06a4308_0 154 KB conda-23.7.2 | py311h06a4308_0 1.3 MB conda-libmamba-solver-23.7.0| py311h06a4308_0 90 KB conda-package-handling-2.2.0| py311h06a4308_0 278 KB conda-package-streaming-0.9.0| py311h06a4308_0 33 KB libcurl-8.1.1 | h251f7ec_2 397 KB openssl-3.0.10 | h7f8727e_0 5.2 MB pip-23.2.1 | py311h06a4308_0 3.3 MB pyopenssl-23.2.0 | py311h06a4308_0 121 KB requests-2.31.0 | py311h06a4308_0 124 KB setuptools-68.0.0 | py311h06a4308_0 1.2 MB ------------------------------------------------------------ Total: 12.2 MB Press y, then hit ENTER to proceed.
Use the command line below from version 4.4.x.conda config --set proxy_servers.http http://cache.univ-st-etienne.fr:port conda config --set proxy_servers.https http://cache.univ-st-etienne.fr:port
You can then install (Finally) the rapidsai package , check the official website : Installation guideInstall the packages by typing y when required, and the following message should display :Proceed ([y]/n)? y Downloading and Extracting Packages: Preparing transaction: done Verifying transaction: done Executing transaction: - done To activate this environment, use # # $ conda activate rapids-23.10 # # To deactivate an active environment, use # # $ conda deactivate Then use the command conda activate rapids-23.10 to activate the rapidsai conda environment ! Then run python3 to test the installation and run the following script :import cudf >>> print(cudf.Series([1, 2, 3])) 0 1 1 2 2 3 dtype: int64
Next step is Running jupyter or IDE on WSL2

ColumnTransformer

Thu, 20 Feb 2025 13:57:06 GMT

Let's dive deeper into the ColumnTransformer and how it handles multi-core processing.ColumnTransformer is a utility provided by scikit-learn to apply different preprocessing or feature extraction pipelines to different subsets of features in a dataset. It allows you to specify which columns should be transformed by which transformers. This is particularly useful when you have a mix of numerical and categorical features that require different preprocessing steps. Transformers: A list of tuples where each tuple contains: A name for the transformer. The transformer object (e.g., OneHotEncoder, StandardScaler). The columns to which the transformer should be applied. Remainder: Specifies how to handle columns that are not explicitly mentioned in the transformers list. Options include: 'drop': Drop the remaining columns. 'passthrough': Pass the remaining columns through without any transformation. A transformer object: Apply the specified transformer to the remaining columns. n_jobs: Specifies the number of jobs to run in parallel. -1 means using all processors. from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder preprocessor = ColumnTransformer( transformers=[ ('num', 'passthrough', df_variables.columns.difference(nominal_cols)), ('cat', OneHotEncoder(drop='first', sparse_output=False), nominal_cols) ], n_jobs=-1 # Use all available cores ) combined_array = preprocessor.fit_transform(df_variables) Importing Necessary Libraries: from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder ColumnTransformer is imported from sklearn.compose. OneHotEncoder is imported from sklearn.preprocessing. Creating the ColumnTransformer Object: preprocessor = ColumnTransformer( transformers=[ ('num', 'passthrough', df_variables.columns.difference(nominal_cols)), ('cat', OneHotEncoder(drop='first', sparse_output=False), nominal_cols) ], n_jobs=-1 # Use all available cores ) Transformers List: ('num', 'passthrough', df_variables.columns.difference(nominal_cols)): Name: 'num' Transformer: 'passthrough' (This means the columns will be passed through without any transformation.) Columns: All columns in df_variables except those in nominal_cols. ('cat', OneHotEncoder(drop='first', sparse_output=False), nominal_cols): Name: 'cat' Transformer: OneHotEncoder(drop='first', sparse_output=False) (This will one-hot encode the categorical columns.) Columns: The columns specified in nominal_cols. n_jobs=-1: This parameter specifies that all available CPU cores should be used for parallel processing. Fitting and Transforming the Data: combined_array = preprocessor.fit_transform(df_variables) The fit_transform method is called on the preprocessor object. This method fits the transformers to the data and then transforms the data. The result is a combined array where the numerical columns are passed through unchanged, and the categorical columns are one-hot encoded. Parallel Processing: ColumnTransformer uses the joblib library to parallelize the fitting and transforming of the data. When n_jobs=-1 is specified, joblib will use all available CPU cores to process the transformations in parallel. Each transformer in the transformers list can be processed independently, allowing for parallel execution. Efficiency: By processing different columns in parallel, ColumnTransformer can significantly speed up the preprocessing step, especially for large datasets. Flexibility: You can apply different preprocessing steps to different subsets of features, making it highly flexible for complex datasets. Integration with Pipelines: ColumnTransformer can be easily integrated into scikit-learn pipelines, making it a powerful tool for building machine learning workflows. The ColumnTransformer object in the provided implementation applies different preprocessing steps to different columns in the DataFrame. It uses parallel processing to speed up the transformation by utilizing all available CPU cores. This makes the preprocessing step more efficient and flexible, especially for datasets with a mix of numerical and categorical features.

Théorème d'Échantillonnage de Nyquist-Shannon

Wed, 19 Feb 2025 13:57:26 GMT

Le théorème de Nyquist-Shannon stipule qu'un signal temporel peut être parfaitement reconstruit à partir de ses échantillons si la fréquence d'échantillonnage est au moins deux fois la fréquence maximale du signal (fréquence de Nyquist). Si cette condition n'est pas respectée, il se produit un phénomène appelé "repliement" ou "aliasing", qui déforme le signal.Pour les signaux spatiaux, le principe est similaire, mais il s'applique à la dimension spatiale plutôt qu'à la dimension temporelle. Voici comment cela fonctionne : Fréquence Spatiale : Au lieu de la fréquence temporelle (mesurée en Hertz, Hz), on parle de fréquence spatiale (mesurée en cycles par unité de distance, par exemple, cycles par mètre). La fréquence spatiale représente le nombre de cycles ou de variations par unité de distance dans un signal spatial. Fréquence d'Échantillonnage Spatiale : La fréquence d'échantillonnage spatiale est le nombre d'échantillons pris par unité de distance. Pour reconstruire fidèlement un signal spatial, la fréquence d'échantillonnage spatiale doit être au moins deux fois la fréquence spatiale maximale du signal. Repliement Spatial (Aliasing) : Si la fréquence d'échantillonnage spatiale est inférieure à deux fois la fréquence spatiale maximale, il se produit un phénomène de repliement spatial, où les hautes fréquences spatiales se replient sur les basses fréquences, déformant ainsi le signal. Prenons l'exemple d'une carte topographique : Fréquence Spatiale : Supposons que la carte représente des variations d'élévation avec une fréquence spatiale maximale de 1 cycle par 100 mètres (c'est-à-dire, une variation complète d'élévation tous les 100 mètres). Fréquence d'Échantillonnage Spatiale : Pour capturer fidèlement ces variations, la fréquence d'échantillonnage spatiale doit être au moins 2 cycles par 100 mètres, ce qui signifie un échantillon tous les 50 mètres. Repliement Spatial : Si nous échantillonnons à une fréquence inférieure (par exemple, un échantillon tous les 150 mètres), les variations rapides de l'élévation ne seront pas capturées correctement, et le signal spatial sera déformé. Le théorème de Nyquist-Shannon s'applique de manière analogue aux signaux spatiaux. La clé est de s'assurer que la fréquence d'échantillonnage spatiale est suffisamment élevée pour capturer toutes les variations significatives du signal spatial. Si cette condition n'est pas respectée, le signal spatial sera sujet à un repliement spatial, ce qui déformera les données capturées.En résumé, pour les signaux spatiaux, le théorème de Nyquist-Shannon stipule que la fréquence d'échantillonnage spatiale doit être au moins deux fois la fréquence spatiale maximale du signal pour permettre une reconstruction fidèle du signal.

Installer DeepSeek R1 Distill en local

Tue, 04 Feb 2025 16:16:31 GMT

Ce guide s'adresse à un public non spécialisé, qui souhaite s'approprier un grand modèle de langage comme assistant conversationnel de type chatGPT gratuitement sans dépendre des interfaces web comme OpenAI, Google Gemini, Claude et des autres GAFAM par souci de confidentialité des données et émancipation technologique.Le but est de vous faire prendre en main un logiciel open-source utilisant des modèles open-source gratuits, ici DeepSeek R1 distillé le tout sans aucune compétence technique ni d'ordinateur spécialisé.Bien évidemment, la puissance de calcul étant un facteur essentiel dans la vitesse d'execution du modèle, la taille du modèle doit être modulée selon les capacités de votre machine. Plus un modèle est petit, moins il sera performant en qualité de réponse, il ne faut donc pas s'attendre aux capacités de raisonnement des modèles de plusieurs centaines de milliard de paramètres. Cependant, les modèles actuels sont bien assez performants pour beaucoup de tâches banales qui ne nécessitent pas de grande capacité de calcul, en particulier tout ce qui touche au traitement de texte.On installera ici des modèles de "petite" taille, inférieurs à 8B (8 Billion / Milliard de paramètres), qui peuvent fonctionner correctement sur une machine individuelle. Minimal : Un ordinateur doté d'au moins 8 Go de RAM Un processeur de moins de 5 ans ~30 Go de libre sur le disque dur Recommandé : Une carte graphique (GPU) NVIDIA de moins de 5 ans et 8 Go de VRAM (mémoire dédiée GPU) doté de l'architecture CUDA, ou pour les utilisateurs Apple, une puce M1 ou plus récent.Warning Si vous n'avez pas de GPU Nvidia, ignorer cette étape. On peut faire tourner le modèle uniquement sur processeur sans problème. Sinon,
Installer l'outil CUDA Toolkit pour pouvoir exploiter la GPU avec CUDA en suivant les étapes suivantes :
Suivre les instructions de l'installer .exe après avoir saisi un mot de passe administrateur. Cliquer sur suivant jusqu'à cette partie et cocher la case :
Suivre les instructions jusqu'à la fin de l'installation. L'installation de Visual Studio n'est pas nécéssaire pour ce cas d'utilisation. Voilà, vous pouvez à présent exploiter les capacités de votre carte graphique pour des grands modèles de langage !
Il existe plusieurs interfaces dévelopées open-source, comme LM Studio ou Jan (développé dans le tuto Installer une LLM en local pour un humain local)
On se concentrera ici sur l'outil GPT4All, également détaillé dans RAG simple, local et Open-Source avec GPT4All
Aller sur la page : https://www.nomic.ai/gpt4all et télécharger l'installeur adapté à votre système d'exploitation.
Sur le réseau de l'université ou en VPN, il vous sera peut-être nécéssaire de modifier les paramètres de proxy pour pouvoir télécharger des modèles par l'interface graphique.Info A ce jour (04/02/25), malgré la configuration du proxy, le téléchargement sur le réseau de l'UJM est toujours bloqué pour moi, il faudra alors télécharger manuellement les modèles sur HuggingFace. Lors de l'affichage de la première fenêtre, cliquer sur Reglages en bas à gauche :
afin de configurer le proxy HTTP : cache.univ-st-etienne.fr Remplacer le port 0 par le port effectif utilisé dans votre réseau.
Dossier d'installation par défaut dans votre dossier utilisateur : C:\Users\username\gpt4all , modifier si vous voulez l'installer ailleurs que dans C:\Suivez les étapes jusqu'à la fin : Si le proxy est correctement configuré, le téléchargement des données se fait par internet. Sinon, l'installation peut se faire entièrement en local également.Fin de l'installation, vous allez pouvoir commencer à utiliser GPT4All ! 🎉
Documentation Pour plus de détails sur l'utilisation du logiciel, se référer à la très claire documentation officielle
https://huggingface.co/ est une plateforme qui héberge et soutient la communauté open-source d'IA, en particulier les grands modèles de langages.
Vous pouvez, dans la barre de recherche, trouver n'importe quel modèle open-source disponible. Nous allons utiliser ici les versions distillées du modèle DeepSeek R1 de base, qui sont des version compressées du modèle original faisant 671B de paramètres (trop lourd donc pour notre machine normale)
Nous allons ici télécharger le modèle suivant :
bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF (Libre à vous d'utiliser d'autres modèles distillés dans la liste précédente adapté à vos besoins)
Sur cette page est affichée toutes les informations nécessaires pour choisir un modèle adapté à votre machine, dont un tableau récapitulatif de la taille du fichier du modèle et sa qualité de compression suite à la quantification (réduction des poids des modèles de 32bits à 4 bits par exemple) :

Comment choisir la taille ? Une bonne estimation de la taille en mémoire de votre modèle sera de 1,2 x Taille en Billion / Milliard (ex: modèle 8B prendra environ 9,6 Go de mémoire ) Les modèles quantifiés sont proposés au format .GGUF qui permettra de l'utiliser directement avec des tailles inférieures. En l'occurrence, la taille recommandée par défaut est la suivante, qui demandera environ 6 Go de mémoire VRAM :En cliquant sur le lien vous arrivez sur la page de téléchargement :
où il vous suffira de cliquer sur un des boutons "download" surlignés.Bravo !👏 Vous avez fini l'étape de téléchargement d'un modèle open-source.Il suffit maintenant de déposer le modèle téléchargé dans le dossier de GPT4All accessible par défaut dans :C:/Users//AppData/Local/nomic.ai/GPT4All/et configurable dans l'onglet "Settings" de GPT4All :
Et puis de relancer l'application pour le voir apparaître dans le menu déroulant de l'onglet chat "Choose a model" :
Pour que le modèle affiche correctement sa sortie avec l'onglet de chaîne de réflexion "thinking", Aller dans :Settings > Model > Chat Template : Supprimer le "system message" prompt pour avoir quelque chose comme ça :
et copier/coller le chat template suivant :{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '' in content %}{% set content = content.split('')[-1] %}{% endif %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<｜tool▁outputs▁end｜>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<｜Assistant｜>'}}{% endif %} Attention, cela peut varier selon les modèles, je vous conseille de plutôt aller chercher à la source, détaillé ici :
Template original PS : Le template de chat original est accessible dans la fiche du modèle précédente sur huggingface en cliquant sur l'un des boutons des versions de modèles : Dans l'onglet métadata :
En bas de page des settings, on peut modifier les éléments suivants :
Configurer la taille de contexte maximale selon la mémoire disponible en plus de celle occupée par le modèle. (ici 131k tokens, mais vous pouvez limiter selon la configuration matérielle à 10k tokens par ex.)La Context Length ou taille de contexte contraint la quantité de texte que le modèle voit à un instant T lors de la génération. En d'autres termes, c'est une fenêtre glissante de texte qui prend en compte :
les instructions explicitement saisies par l'utilisateur (vous) + les instructions de template de chat + Les discussions précédentes + les étapes de réflexion "Chain-of-Thought Prompting" précédentes .Max Length Contrôle la quantité max. de tokens générés en sortie, dont les tokens de "réflexion" pour les modèles de réflexion comme DeepSeek R1. Ce paramètre contraint donc à la fois la longueur maximale de la réponse mais également le temps de réflexion du modèle. J'utilise 8k tokens mais vous pouvez essayer avec différentes valeurs.La Temperature conseillée par les développeurs de DeepSeek est de 0.6 Ce paramètre contrôle la variabilité de la génération.Laisser les autres paramètres par défaut.Et voilà, votre modèle DeepSeek R1 Distill est prêt à l'utilisation ! 😎

Bonus Vous pouvez le coupler à des documents locaux sur votre machine en toute sécurité et confidentialité en suivant le guide RAG simple, local et Open-Source avec GPT4All.

settings_model_gpt4all

Tue, 04 Feb 2025 16:13:37 GMT

exemple_chat_deepseek

Tue, 04 Feb 2025 11:19:44 GMT

chat_template

Tue, 04 Feb 2025 11:16:53 GMT

metadata

Tue, 04 Feb 2025 11:16:03 GMT

quant_prompt_tempate

Tue, 04 Feb 2025 11:15:10 GMT

choose_model_gpt4all

Tue, 04 Feb 2025 11:08:27 GMT

download_path_config

Tue, 04 Feb 2025 11:06:03 GMT

download_page_hf

Tue, 04 Feb 2025 10:54:26 GMT

quant_table

Tue, 04 Feb 2025 10:49:17 GMT

hf_search_bar

Tue, 04 Feb 2025 10:37:41 GMT

gpt4all_frontpage

Tue, 04 Feb 2025 10:20:26 GMT

Ecosystème EVS

Mon, 03 Feb 2025 09:23:38 GMT

S'organise enDépend deEst composé deest unePartenaire deEst Administré parComposé deEst soutenu parFinancé parOrganisé enEn FranceA l'internationalePlateformesDirectionTutellesAteliersEVSLabexInvestissements d'avenirOrganisation territoriale.pngCarteMonde_FR.pngLa direction scientifique du CNRS comprend dix instituts qui pilotent la stratégie de recherche de l'organisme dans leur périmètre disciplinaire et coordonnent les activités et les projets des laboratoires qui leur sont rattachés.
Découvrir les instituts de recherche du CNRSLe CNRS est présent sur tout le territoire à travers ses 17 délégations régionales. Interlocutrices privilégiées des partenaires locaux et des collectivités territoriales, celles-ci assurent une gestion directe et de proximité des laboratoires, et apportent notamment leur aide pour le montage de projets industriels et de programmes européens.Le CNRS est aussi présent en Europe et à l’international avec son réseau de Bureaux situés dans des écosystèmes clés de la recherche mondiale : à Bruxelles, Melbourne, New Delhi, Ottawa, Pékin, Pretoria, Rio de Janeiro, Singapour, Tokyo et Washington.
Découvrir les 17 délégations régionales du CNRSLe CNRS compte plus de 1 100 laboratoires répartis sur l’ensemble du territoire français. Leurs équipes, rassemblant plus de 33 000 personnes au service de la recherche et de l’innovation, sont à l’origine de la production et de la transmission des connaissances. Briques de base de l’organisme, ces laboratoires façonnent le paysage scientifique local et sont en très grande majorité des unités mixtes de recherche (UMR) associées à une université, une école supérieure ou un autre organisme de recherche. Pour un laboratoire, obtenir le statut UMR est une marque de reconnaissance dans le monde de la recherche en France et à l’étranger.. À ces laboratoires nationaux, s’ajoutent 80 laboratoires de recherche internationaux, un nombre en progression constante.
Contacter un laboratoirePasted image 20240626174546.pngorganigramme_cnrs.pnglogo_CNRS.png
Unité Mixte de Recherche :
L'unité mixte de recherche (UMR) est la « brique de base » de l'organisation de la recherche en France 1.
Elle dispose de lignes budgétaires propres, de personnel affecté par les partenaires (CNRS, université, etc.). C'est son directeur, entouré d'un conseil de laboratoire, qui définit sa stratégie de recherche.
L'UMR est dirigée par une Equipe Directoriale listées dans le lien, qui pilote l'UMR et préside les différents conseils listés ici entre autres.
Conseil de Laboratoire : Le conseil de laboratoire a pour vocation d’assister la direction dans la construction de la stratégie scientifique de l’unité. Il est l’instance consultative de l’unité dans laquelle sont représentés l’ensemble des membres et où peuvent s’exprimer les différentes sensibilités présentes à l’intérieur de l’unité.Il est présidé par le directeur du laboratoire ou le directeur-adjoint. Il se réunit au moins trois fois par an (si possible une fois par trimestre) dans une des composantes, sur convocation de la direction ou à la demande du tiers de ses membres. Son ordre du jour est élaboré sur proposition de la direction.
Conseil d'Orientation Scientifique :Le conseil d'Orientation scientifique se réunit tous les deux mois environ. Il est composé de l'équipe directoriale, des responsables de composante, des responsables des ateliers thématiques, des responsables des plateformes.Réseau des Gestionnaires :
Réseau regroupant les gestionnaires d'EVS.Conseils.pnglabex.pngLogo EVS.pngL’UMR 5600 dispose de plusieurs plate-formes de support technique.« Observation et Mesure des Environnements Actuels et Anciens » (OMEAA) est une plate-forme dédiée à la métrologie de terrain et à l’analyse de laboratoire. Issue de la plate-forme « Environnement » du précédent contrat quadriennal, elle est portée conjointement par l’UMR 5600 « Environnement Ville Société » et l’UMR 5133 « Archéorient, environnements et sociétés de l’Orient ancien »
Plus d’informations…La plateforme « Imagerie » et Systèmes d’Information Géographique » (ISIG) est un lieu de partage, d’échange et de mise en réseau des membres de l’UMR 5600 EVS travaillant dans le domaine de l’imagerie et de la géomatique.
Plus d’informations…La plateforme Veille & Valorisation scientifique (2VS) accompagne les membres du laboratoire EVS – UMR 5600 dans leur recherche et valorise leur production scientifique à travers la publication d’une veille scientifique et bibliographique pluridisciplinaire et la diffusion des savoirs produits auprès d’un public plus large.
Plus d’informations…
Le Laboratoire EVS (Environnement Ville Société) est une UMR (Unité Mixte de Recherche) qui a pour but l'analyse dans une démarche d’interdisciplinarité des dynamiques de changement entre environnement, ville et société.Cette UMR est construite par ses composantes inscrites dans différentes Tutelles (Etablissements de recherche).Pour cela l’Unité, présente sur plusieurs campus de Lyon et Saint-Etienne, est structurée non en équipes autonomes, mono-sites et mono-disciplinaires, mais en sept « ateliers » constituant des polarités de recherches thématiques, à travers lesquelles un savoir commun circule et se construit.
Ainsi, EVS développe une posture réflexive sur la place des sciences et des techniques dans la société contemporaine, ainsi que sur l’exercice de la pluralité scientifique en son sein, grâce à un large spectre disciplinaire couvrant principalement la géographie, l'urbanisme, l’anthropologie, le droit, les sciences de l’ingénieur, l’architecture.plateformes.png
Les programmes d'investissement d'avenir (PIA), aussi appelés investissements d’avenir, sont un programme d’investissement de l’État français pour soutenir la recherche et l'innovation initié en 2010. Il s'agit d'une politique publique de l'innovation.L'ensemble des PIA mobilise 77 milliards d’euros. Les gouvernements successifs ont investi à hauteur de 35 milliards (PIA 1), 12 milliards (PIA 2), 10 milliards (PIA 3) et 20 milliards (PIA 4).Une petite partie de cette somme est directement versée sous forme de subventions, le reste étant des prêts ou des placements dont seuls les intérêts sont consommables.
Ces fonds ont permis, entre autres, un soutien à la recherche et aux projets innovants, la création des instituts de recherche technologique (IRT), des instituts hospitalo-universitaires (IHU), des sociétés d’accélération du transfert de technologies (SATT), des écoles universitaires de recherche (EUR) et un soutien au projet de cluster technologique Paris-Saclay.LabEx :
(ou Labex, mot-valise fabriqué à partir de la contraction de Laboratoire d'excellence) est un des instruments du programme d'investissements d'avenir, destiné à soutenir la recherche d'ensemble d'équipes sur une thématique scientifique donnée.
À la suite des travaux de la Commission Juppé-Rocard en 2009, le programme Investissements d'avenir s’est vu confier une enveloppe globale de 35 milliards d’euros1 (puis 12 milliards supplémentaires en 2014) pour que la France se place à la pointe de l’innovation.
Parmi les appels à projets lancés par le gouvernement français. l’appel LabEx (Laboratoire d'excellence) avait pour objectif de doter de moyens significatifs les unités de recherche ayant une visibilité internationale, pour leur permettre de faire jeu égal avec leurs homologues étrangers.Investir_avenir.png
Les Ateliers EVS sont des thématiques de recherches pluridisciplinaires qui ont pour but d'organiser la recherche de l'UMR par groupes où le savoir circule et se construit.Les ateliers sont relationnels et thématisés, formant des polarités de recherche sur les dynamiques du changement et les enjeux territoriaux et environnementaux. Cette organisation scientifique vise à favoriser la pluralité scientifique en associant des chercheurs de disciplines différentes autour d'objets de recherche partagés.Les Tutelles EVS sont les établissement partenaires. Ce sont des institutions publiques de recherche, telles que les écoles d'ingénieurs ou les universités de Lyon et Saint-Etienne.Chaque tutelle est représenté par une ou plusieurs Composantes, qui sont les briques d'unités de recherche qui composent l'UMR.Ils sont résumés dans l'image ci-contre, et les liens de chaque tutelle est listée ci-dessous.
|

logo cnrs|

logo universite Lyon 3|

logo université Lyon 2|
|

Logo universite Jean Monnet|

logo Mines Saint Etienne|

logo INSA|
|

logo ENS Lyon|

logo ENTPE|

logo ENSAL|Composantes.pngtutelles.pngAteliers_EVS.png

Removing files & folders using windows powershell

Thu, 30 Jan 2025 16:12:15 GMT

To remove all .xlsx files from a folder and its recursive sub-folders in Windows Terminal, you can use the Command Prompt (cmd) or PowerShell. Here are the steps for both methods: Open Windows Terminal. Navigate to the folder from which you want to delete the .xlsx files using the cd command. For example: cd path\to\your\folder Use the del command with the /s switch to delete files recursively: del /s *.xlsx Open Windows Terminal. Navigate to the folder from which you want to delete the .xlsx files using the cd command. For example: cd path\to\your\folder Use the Get-ChildItem cmdlet to find all .xlsx files recursively and pipe the results to the Remove-Item cmdlet: Get-ChildItem -Path . -Recurse -Filter *.xlsx | Remove-Item Command Prompt (cmd): del /s *.xlsx: The del command deletes files, and the /s switch makes it recursive. The *.xlsx pattern matches all .xlsx files. PowerShell: Get-ChildItem -Path . -Recurse -Filter *.xlsx: This cmdlet retrieves all items (files) in the current directory (.) and its subdirectories (-Recurse) that match the *.xlsx pattern. | Remove-Item: The pipe (|) sends the output of Get-ChildItem to Remove-Item, which deletes the files. Choose the method that you are most comfortable with. Both will achieve the same result of deleting all .xlsx files from the specified folder and its sub-folders.To remove all empty folders in a directory and its subdirectories using PowerShell, you can use the following script: Open Windows Terminal. Navigate to the folder from which you want to delete the empty folders using the cd command. For example: cd path\to\your\folder Use the following PowerShell script to remove all empty folders recursively: Get-ChildItem -Path . -Recurse -Directory | Where-Object { (Get-ChildItem -Path $_.FullName -Force).Count -eq 0 } | Remove-Item -Recurse -Force Get-ChildItem -Path . -Recurse -Directory: This cmdlet retrieves all directories (folders) in the current directory (.) and its subdirectories (-Recurse). The -Directory switch ensures that only directories are returned. Where-Object { (Get-ChildItem -Path $_.FullName -Force).Count -eq 0 }: This filters the directories to include only those that are empty. Get-ChildItem -Path $_.FullName -Force retrieves the items within each directory, and .Count -eq 0 checks if the count of items is zero. Remove-Item -Recurse -Force: This cmdlet removes the empty directories. The -Recurse switch ensures that all empty subdirectories are also removed, and the -Force switch allows the removal of read-only files and directories. This script will recursively remove all empty folders within the specified directory and its subdirectories.To remove all files except those with the extensions .pdf, .doc, .docx, .ppt, and .pptx in a folder and its subfolders using PowerShell, you can use the following script: Open Windows Terminal. Navigate to the folder from which you want to remove the files using the cd command. For example: cd path\to\your\folder Use the following PowerShell script to remove all files except those with the specified extensions: $excludedExtensions = @(".doc", ".docx", ".ppt", ".pptx",".pdf") Get-ChildItem -Path . -Recurse -File | Where-Object { $excludedExtensions -notcontains $_.Extension } | Remove-Item -Force $excludedExtensions = @(".doc", ".docx", ".ppt", ".pptx",".pdf"): This creates an array of the file extensions that you want to exclude from deletion. Get-ChildItem -Path . -Recurse -File: This cmdlet retrieves all files in the current directory (.) and its subdirectories (-Recurse). The -File switch ensures that only files are returned. Where-Object { $excludedExtensions -notcontains $_.Extension }: This filters the files to include only those whose extensions are not in the $excludedExtensions array. Remove-Item -Force: This cmdlet removes the filtered files. The -Force switch allows the removal of read-only files. This script will delete all files in the specified folder and its subfolders except those with the extensions .doc, .docx, .ppt, .pdf and .pptx.Be very careful when running scripts that delete files, especially recursively. It's a good practice to first run the script without the Remove-Item cmdlet to verify which files will be deleted. You can do this by replacing Remove-Item -Force with Select-Object -ExpandProperty FullName to list the files that would be deleted:$excludedExtensions = @(".doc", ".docx", ".ppt", ".pptx") Get-ChildItem -Path . -Recurse -File | Where-Object { $excludedExtensions -notcontains $_.Extension } | Select-Object -ExpandProperty FullName This will output the paths of the files that would be deleted, allowing you to review them before actually deleting the files.

multivariate distributions

Wed, 29 Jan 2025 15:01:30 GMT

Let's break down the concept of multivariate distributions or Joint probability distributionin increasing levels of difficulty, from an intuitive explanation to a more mathematically rigorous definition.
Multivariate Distribution: Imagine you have a dataset with multiple variables, like height, weight, and age of individuals. A multivariate distribution describes how all these variables are distributed together. It tells you the likelihood of different combinations of these variables occurring. For example, it might tell you how likely it is to find someone who is 1,80m, weighs 70 kg, and is 30 years old. Multivariate Distribution: A multivariate distribution is a probability distribution that describes the behavior of multiple random variables simultaneously. It tells you the probability of different combinations of values for these variables occurring together. For example, if you have two variables ( X ) and ( Y ), the multivariate distribution describes the joint probability of ( X ) and ( Y ). Multivariate Distribution: In a multivariate distribution, each combination of values for the variables has an associated probability. This is often visualized as a multi-dimensional space where the height of the surface at any point represents the probability of that combination of values. For two variables, this can be thought of as a 3D surface plot. For more variables, it becomes harder to visualize but the concept remains the same.
Multivariate Distribution: Suppose ( ) are discrete random variables. The multivariate distribution is described by the joint probability mass function ( ). This function gives the probability of each possible combination of values for the variables. Multivariate Distribution: For continuous random variables ( ), the multivariate distribution is described by the joint probability density function ( ). This function gives the density of probability for each possible combination of values for the variables. The total probability is obtained by integrating this density function over all possible values: Multivariate Distribution: In a multivariate distribution, the variables are often not independent. The dependence between variables can be captured by the covariance matrix, which describes how changes in one variable are related to changes in another. For a multivariate normal distribution, the joint density function is given by: where ( ) is the vector of variables, ( ) is the mean vector, and ( ) is the covariance matrix.Multivariate Distribution:
Advanced topics in multivariate distributions include Copula Model , which model the dependence structure separately from the marginal distributions , and multivariate time series, which model the joint behavior of multiple variables over time. Applications include portfolio optimization in finance, where the joint distribution of asset returns is modeled, and spatial statistics, where the joint distribution of variables over a geographic area is analyzed. The concept of a multivariate distribution starts with an intuitive idea of describing the joint behavior of multiple variables and progresses to a mathematically precise definition involving joint probability functions and dependence structures. Understanding multivariate distributions is crucial for analyzing complex systems where multiple variables interact, and it has wide-ranging applications in fields such as finance, ecology, and engineering.

multivariate_normal_3D

Wed, 29 Jan 2025 14:57:55 GMT

multivariate_normal

Wed, 29 Jan 2025 14:56:17 GMT

marginal distributions

Wed, 29 Jan 2025 14:53:35 GMT

Let's break down the concept of a marginal distribution in increasing levels of difficulty, from an intuitive explanation to a more mathematically rigorous definition.Marginal Distribution: Imagine you have a dataset with multiple variables (e.g., height and weight of individuals). The marginal distribution of one variable (e.g., height) tells you how that variable is distributed on its own, ignoring the other variables. It's like looking at the distribution of heights without considering the weights. Marginal Distribution: In a dataset with two variables, say ( X ) and ( Y ), the marginal distribution of ( X ) is the distribution of ( X ) when you ignore ( Y ). It shows the probability of different values of ( X ) occurring, regardless of the values of ( Y ). Marginal Distribution: For a bivariate distribution (two variables), the marginal distribution of ( X ) can be thought of as the "shadow" or "projection" of the joint distribution onto the ( X )-axis. It summarizes the distribution of ( X ) by integrating out the influence of ( Y ). Marginal Distribution: Suppose ( X ) and ( Y ) are discrete random variables with a joint probability mass function ( P(X = x, Y = y) ). The marginal distribution of ( X ) is given by: [ ] This means you sum the joint probabilities over all possible values of ( Y ) to get the marginal probability of ( X ). Marginal Distribution: For continuous random variables ( X ) and ( Y ) with a joint probability density function ( f(x, y) ), the marginal distribution of ( X ) is given by: [ ] This means you integrate the joint density function over all possible values of ( Y ) to get the marginal density of ( X ). Marginal Distribution: For a multivariate distribution with random variables ( ), the marginal distribution of a subset of these variables, say ( ) and ( ), is obtained by integrating (or summing, in the discrete case) the joint distribution over the remaining variables: [ ] This generalizes the concept to any number of variables, showing how to "marginalize" over the unwanted variables to focus on the distribution of the variables of interest. The concept of a marginal distribution starts with an intuitive idea of looking at one variable's distribution while ignoring others and progresses to a mathematically precise definition involving integration or summation over the joint distribution. Understanding marginal distributions is crucial for analyzing multivariate data and focusing on the behavior of individual variables within a complex system.

Copula Model

Wed, 29 Jan 2025 14:52:36 GMT

A copula model is a statistical tool used to describe and model the dependence structure between multiple random variables. Copulas allow for the construction of multivariate distributions with specified marginal distributions and dependence structures. This makes them particularly useful in fields where understanding the joint behavior of multiple variables is crucial, such as finance, hydrology, and environmental science. Copula Function: A copula is a multivariate cumulative distribution function (CDF) that links the marginal distributions of individual variables to their joint distribution. Formally, a copula ( C ) is a function that maps the unit hypercube to and satisfies certain properties, including groundedness, uniform marginals, and the 2-increasing property.
Sklar's Theorem: Sklar's theorem states that any multivariate joint distribution can be written in terms of univariate marginal distribution functions and a copula which describes the dependence structure between the variables. Sklar's theorem is fundamental to copula theory. It states that any multivariate CDF can be decomposed into its marginal CDFs and a copula : Conversely, given a copula ( ) and marginal CDFs ( ), the joint CDF ( ) can be constructed using the above equation. Types of Copulas: Gaussian Copula: Based on the multivariate normal distribution, it captures linear dependence and is parameterized by a correlation matrix. t-Copula: Similar to the Gaussian copula but allows for tail dependence, making it useful for modeling extreme events. Archimedean Copulas: Includes families like Clayton, Gumbel, and Frank copulas, which are often used for their flexibility in capturing different types of dependence structures. Empirical Copula: Constructed directly from data, it provides a non-parametric way to model dependence. Hydrology: In hydrology, copulas are used to model the joint behavior of hydrological variables, such as rainfall and river flow, to better understand and predict flood events. Environmental Science: Copulas can model the dependence between environmental variables, such as temperature and precipitation, to study climate change and its impacts.
Copula models provide a flexible and powerful framework for modeling the dependence structure between multiple random variables. By separating the marginal distributions from the dependence structure, copulas allow for the construction of complex multivariate distributions tailored to specific applications. They are widely used in various fields to capture and analyze the joint behavior of interrelated variables, aiding in decision-making and risk management.

Reward Distribution in the Multi-Armed Bandit Problem

Wed, 29 Jan 2025 13:40:31 GMT

In the multi-armed bandit problem, each arm is associated with a reward Probability Distribution, which describes the possible outcomes and their probabilities when that arm is pulled. The reward distribution is a fundamental concept that captures the uncertainty and variability in the rewards obtained from each arm. Understanding and modeling this distribution is crucial for designing effective algorithms that balance exploration and exploitation.
Probability Distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of possible outcomes for an experiment.It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events (subsets of the sample space) Bernoulli Distribution: In many bandit problems, especially in binary outcomes (e.g., success or failure), the reward distribution is modeled as a Bernoulli distribution. Each arm has a probability ( p ) of yielding a reward of 1 (success) and a probability ( 1-p ) of yielding a reward of 0 (failure). Gaussian (Normal) Distribution: When the rewards are continuous, they are often modeled using a Gaussian distribution. Each arm has a mean reward ( ) and a variance ( ). The reward ( r ) obtained from pulling an arm is a random variable drawn from ( ). Poisson Distribution: In scenarios where the rewards represent counts of events (e.g., number of clicks on an ad), a Poisson distribution might be used. Each arm has a rate parameter ( \lambda ), and the reward is a random variable drawn from ( \text{Poisson}(\lambda) ). Exponential Distribution: For problems where the rewards represent waiting times or intervals (e.g., time until the next event), an exponential distribution might be appropriate. Each arm has a rate parameter ( \lambda ), and the reward is a random variable drawn from ( \text{Exponential}(\lambda) ). To make informed decisions, the agent needs to estimate the reward distribution of each arm. This is typically done through sampling: the agent pulls an arm and observes the reward, updating its estimate of the reward distribution based on the observed data. Common techniques for estimating the reward distribution include: Sample Mean: For Gaussian distributions, the sample mean is a straightforward and unbiased estimator of the true mean reward. The agent can compute the average reward obtained from pulling an arm multiple times. Maximum Likelihood Estimation (MLE): MLE is a general method for estimating the parameters of a distribution. For Bernoulli distributions, the MLE of the success probability ( p ) is the proportion of successful outcomes observed. Bayesian Estimation: Bayesian methods incorporate prior beliefs about the reward distribution and update these beliefs using observed data. For Bernoulli distributions, a Beta prior can be used, and the posterior distribution is updated using Bayes' theorem. The reward distribution plays a critical role in the exploration-exploitation trade-off. Exploration involves pulling arms to gather more information about their reward distributions, while exploitation involves pulling the arm with the highest estimated reward based on current information. Algorithms like Upper Confidence Bound (UCB) and Thompson Sampling use the estimated reward distributions to balance this trade-off effectively.In summary, the reward distribution in the multi-armed bandit problem describes the possible rewards and their probabilities for each arm. Understanding and estimating these distributions is essential for designing algorithms that can make optimal decisions in the face of uncertainty.

ML Tools Nebulae

Wed, 29 Jan 2025 13:14:33 GMT

Python LibrariesSoftware Engineering ToolsClosed-Source GiantsML & DL FrameworksHuggingFaceHardware & Model DeploymentPackage managementCode completion AIExperiment Tracking and Versioning: MLflow: An open-source platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment. Weights & Biases (W&B): A tool for tracking experiments and visualizing results. DVC (Data Version Control)**: An open-source tool for versioning datasets and machine learning models.CUDA GPU Nvidia NIMDocker: A platform to package, ship, and run applications as lightweight containers.Kubernetes: An open-source platform designed to automate deploying, scaling, and operating application containers.Anaconda Scikit-Learn: A simple and efficient tool for predictive data analysis, built on NumPy, SciPy, and Matplotlib. XGBoostCatBoostPyTorch TensorFlow: An open-source machine learning framework developed by Google. PIP & Python Package Index (PyPI)CondaNumPy: A fundamental package for scientific computing with Python. Plotly: An open-source graphing library that makes interactive, publication-quality graphs online. Pandas: A powerful data manipulation library in Python. Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations in Python. Seaborn: A Python visualization library based on Matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics. (different from GitHub Copilot)Dall-E (image generation)SORA (video generation) A technology company known for its graphics processing units (GPUs), which are widely used in machine learning and AI.LangChain is an open-source (MIT License) framework designed to simplify the development of applications that leverage language models. It provides a suite of tools and abstractions that make it easier to build, test, and deploy language model-based applications.
Here's a summary of what LangChain does: Modular Components: LangChain offers modular components that can be combined to create complex language model applications. These components include prompts, models, chains, agents, memory, and more. Prompt Management: LangChain helps manage prompts, which are the inputs given to language models. It provides tools to create, format, and optimize prompts for better results. Chains: LangChain allows chaining together multiple calls to language models, enabling complex, multi-step workflows. These chains can be simple sequential calls or more complex branching logic. Agents: LangChain provides agents that can use tools and make decisions based on the outputs of language models. Agents can perform tasks like web browsing, API interaction, or even using other language models. Memory: LangChain includes memory systems that allow language models to maintain context across multiple interactions. This is crucial for building conversational agents and other stateful applications. Evaluation: LangChain offers tools for evaluating and testing language model applications, helping developers to iterate and improve their systems. Integration: LangChain is designed to be model-agnostic, allowing it to integrate with various language models and APIs, including those from Hugging Face, OpenAI, and others. In essence, LangChain is a framework that helps developers harness the power of language models more effectively, enabling the creation of sophisticated AI applications with less effort.LangChain 🦜LLamaIndexNOMIC AI : Focuses on creating models that understand and generate natural language.Nomic AtlasGPT4AllNomid Embed text & visionChatGPTLLamaParseGreen : Open-Source Orange : Closed-Source Blue : Development toolsLLama ModelsHF SpacesCourses & GuidesHugging Face leaderboardsThe Hugging Face Transformers library is a popular open-source library that provides pre-trained models for natural language processing (NLP) tasks. It is built on top of PyTorch and TensorFlow, offering a wide range of architectures and pre-trained weights for various NLP tasks such as text classification, question answering, language translation, and more.To get started with the Transformers library, you can install it using pip:pip install transformers The Transformers library offers several key features that make it a powerful tool for NLP tasks: Pre-trained Models: The library provides a vast collection of pre-trained models, including BERT, RoBERTa, DistilBERT, T5, and many others. These models have been trained on large datasets and can be fine-tuned for specific tasks. Model Architectures: The library supports a wide range of model architectures, including transformers, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). This allows you to choose the architecture that best suits your task. Tokenizers: The library includes tokenizers for various languages and models, making it easy to preprocess text data for input into the models. Datasets: The library integrates with the Hugging Face Datasets library, providing access to a wide range of datasets for training and evaluation. Trainer API: The library provides a high-level Trainer API that simplifies the training and evaluation of models. The Trainer API handles many of the details of training, such as data loading, optimization, and evaluation.
For more detailed information, you can refer to the official Hugging Face Transformers documentation. The documentation provides comprehensive guides, tutorials, and API references to help you get the most out of the library.Additionally, the Hugging Face community is very active, and you can find many resources, including forums, blog posts, and code examples, to help you with your NLP projects.The Hugging Face Transformers library is a powerful and flexible tool for natural language processing. With its wide range of pre-trained models, model architectures, and easy-to-use APIs, it enables developers to build and deploy state-of-the-art NLP applications with ease. Whether you are a beginner or an experienced NLP practitioner, the Transformers library has something to offer you.Hugging Face Transformers libraryModels NOUS Research: A research group dedicated to advancing open and accessible AI. Aims to develop large-scale AI models and share findings publicly. Web IDEsJupyterLab: An interactive development environment for working with notebooks, code, and data.IntelliJ AI plugin (API endpoint of OpenAI) : GitHub Copilot: An AI-powered code completion tool that assists developers by suggesting code snippets. Codeium : A tool that helps in code completion and generation.Mistral AI : A cutting-edge AI company based in Paris, France. Develops large language models and AI tools for various applications. Gemma modelsSystems that track changes to source code over time, allowing multiple people to collaborate on code. Microsoft Visual Studio: A comprehensive IDE developed by Microsoft, supporting a wide range of programming languages and platforms.Git : A distributed version control system that tracks changes in source code during software development.GitLab : A web-based Git repository manager providing version control, issue tracking, and CI/CD. Google Colab: A cloud-based Jupyter notebook environment that requires no setup and runs entirely in the cloud.Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.Software that combines multiple development tools into one graphical user interface.PyCharm : An IDE specifically designed for Python development, offering tools for coding, debugging, and testingDataSpell: An IDE for data science and machine learningGitHub (Microsoft Owned): A web-based hosting service for version control using Git.
Intro : https://keras.io/getting_started/intro_to_keras_for_engineers/Keras 3 is a deep learning framework works with TensorFlow, JAX, and PyTorch interchangeably. This notebook will walk you through key Keras 3 workflows.All Keras models can be trained and evaluated on a wide variety of data sources, independently of the backend you're using. This includes: NumPy arrays Pandas dataframes
TensorFlow tf.data.Dataset objects PyTorch DataLoader objects Keras PyDataset objects The Keras project isn't limited to the core Keras API for building and training neural networks. It spans a wide range of related initiatives that cover every step of the machine learning workflow.
KerasTuner Documentation - KerasTuner GitHub repositoryKerasTuner is an easy-to-use, scalable hyperparameter optimization framework that solves the pain points of hyperparameter search. Easily configure your search space with a define-by-run syntax, then leverage one of the available search algorithms to find the best hyperparameter values for your models. KerasTuner comes with Bayesian Optimization, Hyperband, and Random Search algorithms built-in, and is also designed to be easy for researchers to extend in order to experiment with new search algorithms.
KerasHub Documentation - KerasHub GitHub repositoryKerasHub is a natural language processing library that supports users through their entire development cycle. Our workflows are built from modular components that have state-of-the-art preset weights and architectures when used out-of-the-box and are easily customizable when more control is needed.
KerasCV Documentation - KerasCV GitHub repositoryKerasCV is a repository of modular building blocks (layers, metrics, losses, data-augmentation) that applied computer vision engineers can leverage to quickly assemble production-grade, state-of-the-art training and inference pipelines for common use cases such as image classification, object detection, image segmentation, image data augmentation, etc.KerasCV can be understood as a horizontal extension of the Keras API: the components are new first-party Keras objects (layers, metrics, etc) that are too specialized to be added to core Keras, but that receive the same level of polish and backwards compatibility guarantees as the rest of the Keras API.
AutoKeras Documentation - AutoKeras GitHub repository
AutoKeras is an AutoML system based on Keras. It is developed by DATA Lab at Texas A&M University. The goal of AutoKeras is to make machine learning accessible for everyone. It provides high-level end-to-end APIs such as ImageClassifier or TextClassifier to solve machine learning problems in a few lines, as well as flexible building blocks to perform architecture search.import autokeras as ak clf = ak.ImageClassifier() clf.fit(x_train, y_train) results = clf.predict(x_test)KerasJax

Spatial autoregressive models

Wed, 29 Jan 2025 13:06:49 GMT

Spatial autoregressive models are statistical techniques used to analyze and model spatial data, also known as Spatial modelling, where observations are correlated due to their spatial proximity.
These models are particularly useful in fields like ecology, environmental science, Spatial Data Science, Spatial Data Analysis and geography, where understanding spatial dependencies is crucial for accurate predictions and mapping. Spatial Dependency: Spatial dependency refers to the phenomenon where the value of a variable at one location is influenced by the values of the same variable at neighboring locations. This is often captured using a spatial weights matrix, which defines the spatial relationships between different locations. Spatial Weights Matrix: The spatial weights matrix is a key component of spatial autoregressive models. It is a square matrix where each element represents the strength of the spatial relationship between locations and . Common choices for include binary contiguity (neighboring locations have a weight of 1, others have 0) and distance-based weights. Spatial Lag Model (SLM): The spatial lag model incorporates spatial dependency directly into the dependent variable. The model can be written as: where is the vector of observations, is the spatial autoregressive parameter is the spatial weights matrix, X is the matrix of explanatory variables, is the vector of coefficients, and is the error term. Spatial Error Model (SEM): The spatial error model incorporates spatial dependency into the error term. The model can be written as: where u is the spatially correlated error term, is the spatial autoregressive parameter for the error term, and is the independent error term. Spatial Durbin Model (SDM): The spatial Durbin model combines elements of both the spatial lag and spatial error models. It includes spatial lags of both the dependent variable and the explanatory variables. The model can be written as: where represents the spatial lags of the explanatory variables, and is the vector of coefficients for these spatial lags. Spatial autoregressive models are widely used in spatial ecological mapping to understand and predict the distribution of ecological variables, such as species abundance, habitat suitability, and environmental indicators. Here are some specific applications: Species Distribution Modeling: Spatial autoregressive models can be used to predict the distribution of species across a landscape. By incorporating spatial dependency, these models can capture the influence of neighboring habitats on species presence or abundance, leading to more accurate predictions. Habitat Suitability Mapping: Habitat suitability models assess the quality of habitats for specific species. Spatial autoregressive models can improve these assessments by accounting for the spatial correlation in habitat quality, which is often influenced by neighboring habitats. Environmental Indicator Mapping: Environmental indicators, such as air or water quality, often exhibit spatial dependency. Spatial autoregressive models can be used to map these indicators, providing insights into the spatial patterns and hotspots of environmental degradation or improvement. Spatial autoregressive models are powerful tools for analyzing and modeling spatial data, particularly in ecological mapping. By incorporating spatial dependency, these models provide more accurate predictions and insights into the spatial patterns of ecological variables. They are essential for understanding the complex interactions and dependencies in spatial data, aiding in conservation efforts, environmental management, and ecological research.

Multi-arm Bandit Problem

Wed, 29 Jan 2025 10:42:30 GMT

The multi-armed bandit problem is a classic decision-making framework where an agent must choose actions (or "arms") to maximize cumulative rewards over time. Each arm is associated with a reward distribution, and the agent's goal is to balance exploration (trying new arms to gather information) and exploitation (choosing the arm with the highest known reward). This problem is fundamental in various fields, including online advertising, clinical trials, and financial portfolio management.Correlated bandits are a variant of the multi-armed bandit problem where the rewards of different arms are not independent but are correlated. This means that the reward obtained from one arm can provide information about the rewards of other arms. For example, in a recommendation system, the user's preference for one item might be correlated with their preference for another item.In correlated bandits, the agent can leverage these correlations to make more informed decisions. By observing the reward from one arm, the agent can update its beliefs about the rewards of other arms, potentially reducing the need for extensive exploration. This can lead to more efficient learning and better overall performance.
Algorithms for correlated bandits often incorporate models that capture the dependencies between arms, such as Gaussian processes or Bayesian networks. These models allow the agent to make predictions about the rewards of unobserved arms based on the observed rewards and the known correlations. By doing so, the agent can focus its exploration on the most informative arms, thereby improving the trade-off between exploration and exploitation.In summary, correlated bandits extend the traditional multi-armed bandit problem by introducing dependencies among the arms. This variant is particularly useful in scenarios where the rewards of different actions are interrelated, enabling more efficient decision-making.

multi-arm_bandit

Wed, 29 Jan 2025 10:42:18 GMT

Adapters

Mon, 27 Jan 2025 09:05:04 GMT

Links : https://magazine.sebastianraschka.com/p/finetuning-llms-with-adapters
Large language models (LLMs) like BERT, GPT-3, GPT-4, LLaMA, and others are trained on a large corpus of data and have general knowledge. However, they may not perform as well on specific tasks without finetuning. For example, if you want to use a pretrained LLM for analyzing legal or medical documents, finetuning it on a corpus of legal documents can significantly improve the model's performance. (Interested readers can find an overview of different LLM finetuning methods in my previous article, Finetuning Large Language Models: An Introduction To The Core Ideas And Approaches.)However, finetuning LLMs can be very expensive in terms of computational resources and time, which is why researchers started developing parameter-efficient finetuning methods.
As discussed in a previous article, many different types of parameter-efficient methods are out there. In an earlier post, I wrote about prompt and prefix tuning. (Although the techniques are somewhat related, you don't need to know or read about prefix tuning before reading this article about adapters.)In a nutshell, prompt tuning (different from prompting) appends a tensor to the embedded inputs of a pretrained LLM. The tensor is then tuned to optimize a loss function for the finetuning task and data while all other parameters in the LLM remain frozen. For example, imagine an LLM pretrained on a general dataset to generate texts. Prompt (fine)tuning would entail taking this pretrained LLM, adding prompt tokens to the embedded inputs, and then finetuning the LLM to perform, for example, sentiment classification on a finetuning dataset.The main idea behind prompt tuning, and parameter-efficient finetuning methods in general, is to add a small number of new parameters to a pretrained LLM and only finetune the newly added parameters to make the LLM perform better on (a) a target dataset (for example, a domain-specific dataset like medical or legal documents) and (b) a target task (for example, sentiment classification).

The original adapter method (Houlsby et al. 2019) is somewhat related to the aforementioned prefix tuning method as they also add additional parameters to each transformer block. However, while prefix tuning prepends tunable tensors to the embeddings, the adapter method adds adapter layers in two places, as illustrated in the figure below.
Note that the fully connected layers of the adapters are usually relatively small and have a bottleneck structure similar to autoencoders. Each adapter block's first fully connected layer projects the input down onto a low-dimensional representation. The second fully connected layer projects the input back into the input dimension. How is this parameter efficient? For example, assume the first fully connected layer projects a 1024-dimensional input down to 24 dimensions, and the second fully connected layer projects it back into 1024 dimensions. This means we introduced 1,024 x 24 + 24 x 1,024 = 49,152 weight parameters. In contrast, a single fully connected layer that reprojects a 1024-dimensional input into a 1,024-dimensional space would have 1,024 x 1024 = 1,048,576 parameters.

adapter_layer

Mon, 27 Jan 2025 09:04:50 GMT

PEFT_adapters

Mon, 27 Jan 2025 09:03:12 GMT

Large Language Model

Fri, 17 Jan 2025 15:51:08 GMT

Large Language Models (LLMs) are advanced artificial intelligence systems designed to understand, generate, and interact with human language. These models are trained on vast amounts of text data to capture the nuances of language, enabling them to perform a wide range of natural language processing (NLP) tasks such as text generation, translation, summarization, and question answering.
LLMs are typically built using deep learning techniques, particularly transformer architectures. They are trained on massive datasets containing billions of words to learn the statistical patterns and structures of language. The training process involves feeding the model large amounts of text and adjusting its parameters to minimize the difference between its predictions and the actual text.Key components of LLMs include:
embedding Layer: Converts input text into numerical vectors.
Transformer Layers: Process these vectors through multiple layers of self-attention and feed-forward neural networks. Output Layer: Generates the final text output based on the processed vectors.
Transformer Architecture Description: Introduced by Vaswani et al. in 2017, the transformer architecture uses self-attention mechanisms to weigh the importance of different words in a sentence. It consists of an encoder and a decoder, both made up of stacked layers of self-attention and feed-forward networks.
Popular Models: BERT (Bidirectional Encoder Representations from Transformers), RoBERTa, T5 (Text-to-Text Transfer Transformer).
BERT (Bidirectional Encoder Representations from Transformers) Description: Developed by Google, BERT is designed to understand the context of a word by looking at both the left and right sides of the word. It is pre-trained on a large corpus of text and can be fine-tuned for specific NLP tasks. Applications: Text classification, named entity recognition, question answering. T5 (Text-to-Text Transfer Transformer) Description: Also developed by Google, T5 frames all NLP tasks as text-to-text problems. It uses a unified architecture where both the input and output are text, making it highly versatile. Applications: Machine translation, summarization, text generation.
Hugging Face Transformers library Description: An open-source library by Hugging Face that provides pre-trained models for various NLP tasks. It supports a wide range of transformer architectures and allows for easy fine-tuning and deployment.
Features: Pre-trained models, tokenizers, fine-tuning capabilities, integration with popular ML frameworks like PyTorch and Tensorflow. Google's BERT and T5 Description: Google offers pre-trained BERT and T5 models that can be fine-tuned for specific tasks. These models are available through TensorFlow and PyTorch libraries. Features: Pre-trained models, fine-tuning scripts, integration with Google Cloud services. Microsoft's Turing-NLG Description: Microsoft's Turing Natural Language Generation (Turing-NLG) is a large language model designed for generating human-like text. It is one of the largest models available, with billions of parameters. Features: High-quality text generation, integration with Azure services, support for various NLP tasks.
Optimizing Large Language Models (LLMs) is crucial for enhancing their performance, efficiency, and applicability to specific tasks. Several techniques are employed to achieve this, including prompt engineering, Retrieval-Augmented Generation (RAG), and fine-tuning.
Definition: Prompt Engineering involves crafting specific input prompts to guide the model's output more effectively. By carefully designing the input, users can influence the model to generate more relevant and contextually appropriate responses.Techniques: Zero-Shot Learning: Providing the model with a clear and concise prompt that describes the task without any examples.
Few-Shot Learning: Including a few examples in the prompt to help the model understand the task better.

Chain-of-Thought Prompting: Breaking down complex tasks into smaller, manageable steps within the prompt to guide the model through the reasoning process. Applications: Improving the relevance and coherence of generated text. Enhancing the model's performance on specific tasks without additional training. Definition: RAG combines the strengths of retrieval-based methods and generative models. It involves retrieving relevant documents or information from a large corpus and using this information to augment the generation process.Components: Retriever: A model that retrieves relevant documents or passages from a large corpus based on the input query. Generator: A language model that generates the final output using the retrieved information as additional context. Applications: Enhancing the factual accuracy of generated text. Improving the model's performance on tasks that require external knowledge, such as question answering and summarization. Definition: Fine-tuning involves taking a pre-trained language model and further training it on a specific dataset to adapt it to a particular task or domain. This process adjusts the model's parameters to better capture the nuances of the target task.Techniques: Task-Specific Fine-Tuning: Training the model on a labeled dataset specific to the task, such as sentiment analysis or named entity recognition. Domain-Specific Fine-Tuning: Training the model on a dataset from a specific domain, such as medical or legal texts, to improve its performance in that domain. Applications: Improving the model's performance on specialized tasks and domains. Adapting the model to new or emerging areas where pre-trained models may not perform well. Optimizing Large Language Models through prompt engineering, Retrieval-Augmented Generation (RAG), and fine-tuning can significantly enhance their performance and applicability. Prompt engineering allows for more controlled and relevant outputs, RAG improves factual accuracy and contextual understanding, and fine-tuning adapts the model to specific tasks and domains. By leveraging these techniques, LLMs can be tailored to meet the diverse needs of various applications, from general-purpose text generation to specialized domain-specific tasks.

Retrieval-Augmented Generation

Fri, 17 Jan 2025 15:50:40 GMT

#models #LLM #bibliography
Retrieval-Augmented Generation (RAG) is a machine learning technique that combines the strengths of two approaches: retrieval and generation. Retrieval models excel at finding relevant information from a large corpus, while generative models are skilled at producing creative text formats.RAG leverages these strengths to create summaries that are both comprehensive and accurate. It starts by retrieving relevant documents from a bibliography database using a retrieval model. The retrieved documents are then processed by a generative model, which generates a summary that captures the key points of the retrieved documents. This summary is then refined and polished to ensure accuracy, clarity, and conciseness.RAG offers several advantages for summarizing bibliographic databases: Accuracy: RAG ensures that the summary is based on the most relevant and up-to-date information from the bibliography database. This is because the retrieval model first identifies the most relevant documents, and then the generative model focuses on extracting the key points from these documents. This approach helps to ensure that the summary is accurate and reflects the current state of knowledge on the topic. Comprehensiveness: RAG captures the key points of multiple documents, providing a more holistic overview than a summary based on a single document. This is because the generative model is able to synthesize information from multiple sources, which helps to avoid bias and provide a more balanced perspective. Creativity: RAG generates summaries in natural language, making them easy to read and understand. Scalability: RAG can be applied to very large bibliographies, making it ideal for summarizing large research datasets. This is because the retrieval and generation models are able to handle large amounts of data efficiently. This scalability is important because research libraries and other organizations are increasingly collecting and curating large bibliographic datasets. Here's a simplified breakdown of the RAG process for summarizing a bibliography database: Retrieval: The retrieval model searches the bibliography database for documents related to the specified topic or query. Extraction: Relevant information from the retrieved documents is extracted and organized into a structured format. Generation: A generative model is trained on a dataset of human-written summaries. The extracted information from the bibliography database is fed into the generative model, which generates a summary in natural language. Refinement: The generated summary is reviewed and refined by a human editor to ensure accuracy, clarity, and conciseness. Retrieval-Augmented Generation has the potential to revolutionize the way we summarize bibliographic databases, making it easier to access and understand the wealth of information contained within these collections.Here are some examples of how RAG can be used to summarize bibliographic databases for reliable research information: Researchers can use RAG to quickly summarize a large number of articles on a specific topic. This can help them to identify the most relevant and important information, and to avoid wasting time on irrelevant or outdated sources. Professors can use RAG to create summaries of assigned readings for their students. This can help students to quickly grasp the main ideas of the readings, and to prepare for class discussions or exams. Libraries can use RAG to create summaries of their bibliographic collections. This can help library users to quickly find the information they need, and to make informed decisions about which resources to explore. In addition to these specific applications, RAG has the potential to play a more general role in improving access to and understanding of research information. By making it easier to summarize large bibliographic datasets, RAG can help researchers, educators, and librarians to disseminate knowledge more effectively and efficiently.Overall, Retrieval-Augmented Generation is a promising new technique that has the potential to revolutionize the way we summarize bibliographic databases for reliable research information. By combining the strengths of retrieval and generation, RAG can create summaries that are both comprehensive and accurate, and that are easy to read and understand. This makes RAG an invaluable tool for researchers, educators, and librarians who are looking to make the most of the wealth of information available in bibliographic databases.How to use Retrieval-Augmented Generation (RAG) to query specific information accurately in a textual database: Formulate a clear and well-structured query. Choose an appropriate retrieval model for your textual database. Train/download a generative model on a dataset of human-written summaries. Utilize the retrieval model to retrieve relevant documents. Feed the retrieved documents into the trained generative model to generate an accurate summary. Review the generated summary to ensure it is free of errors and concise. There are several factors to consider when choosing a retrieval model for your textual database. These factors include: The size and type of your corpus: If your corpus is small or relatively unstructured, a simpler retrieval model, such as Boolean retrieval or vector space retrieval, may be sufficient. However, if your corpus is large or contains a lot of complex relationships between documents, you may need to use a more sophisticated retrieval model, such as latent semantic indexing (LSI) or topic modeling. The types of queries you want to support: If you need to support complex queries with multiple keywords, you may need to use a retrieval model that can handle semantic relationships between words. For example, if you are searching for documents about "artificial intelligence," you may want to retrieve documents that contain words like "machine learning," "neural networks," and "expert systems." The performance requirements you have: If you need to retrieve documents quickly, you may want to use a retrieval model that is optimized for speed. For example, if you are building a search engine for a web application, you may need to use a retrieval model that can handle a large number of queries per second. Your computational resources: The computational resources you have available will also affect your choice of retrieval model. Some retrieval models are more computationally expensive than others. If you have limited computational resources, you may want to choose a simpler model that is easier to implement and run. There are several open source retrieval models available. Some of the most popular open source retrieval models include: Apache Solr: Solr is a popular open source search engine that is based on the Lucene library. Solr supports a variety of retrieval models, including Boolean retrieval, vector space retrieval, and LSI. Elasticsearch: Elasticsearch is another popular open source search engine that is based on the Apache Lucene library. Elasticsearch supports a variety of retrieval models, including Boolean retrieval, vector space retrieval, and LSI. Whoosh: Whoosh is a smaller and more lightweight open source search engine library that is based on the Lucene library. Whoosh supports a variety of retrieval models, including Boolean retrieval, vector space retrieval, and LSI. Teiid: Teiid is a Java-based open source data integration platform that supports a variety of data sources, including relational databases, XML documents, and NoSQL databases. Teiid also includes a retrieval model that can be used to search for data in these sources. Apache Doris: Apache Doris is a Distributed Relational Database Management System (DRBMS) optimized for analytical workloads. Doris supports a variety of retrieval models, including Boolean retrieval, vector space retrieval, and LSI. In addition to these open source retrieval models, there are also a number of commercial retrieval models available. These models typically offer more features and capabilities than open source models, but they may also be more expensive.Choosing the Right Retrieval ModelThe best way to choose the right retrieval model for your textual database is to experiment with a few different models and see which one works best for your specific needs. There are no hard and fast rules for choosing a retrieval model, so it is important to evaluate different models based on your own criteria.

Software as a Service

Thu, 16 Jan 2025 10:30:11 GMT

Un SaaS, ou "Software as a Service" (Logiciel en tant que Service), est un modèle de distribution de logiciels où les applications sont hébergées par un fournisseur de services et mises à disposition des clients via Internet. Les utilisateurs accèdent au logiciel via un navigateur web ou une application client, sans avoir besoin d'installer ou de maintenir le logiciel sur leurs propres machines. Les exemples courants de SaaS incluent les services de messagerie électronique, les outils de gestion de la relation client (CRM), et les suites bureautiques en ligne.Le lien entre SaaS et les API (Interfaces de Programmation d'Applications) est crucial. Les API permettent aux applications SaaS de communiquer avec d'autres logiciels et services. Voici quelques points clés sur ce lien : Intégration : Les API permettent aux applications SaaS de s'intégrer avec d'autres systèmes et services. Par exemple, un CRM SaaS peut utiliser une API pour se connecter à un service de messagerie électronique afin d'envoyer des emails automatisés. Extensibilité : Les API permettent aux développeurs d'étendre les fonctionnalités d'une application SaaS. Par exemple, une entreprise peut utiliser une API pour ajouter des fonctionnalités spécifiques à son besoin à une application SaaS existante. Automatisation : Les API permettent l'automatisation des processus entre différentes applications. Par exemple, une API peut être utilisée pour synchroniser automatiquement les données entre un système de gestion des stocks et un système de comptabilité. Personnalisation : Les API permettent aux utilisateurs de personnaliser leur expérience avec une application SaaS. Par exemple, une entreprise peut utiliser une API pour créer des rapports personnalisés ou des tableaux de bord spécifiques à ses besoins. En résumé, les API jouent un rôle essentiel dans le fonctionnement et l'efficacité des applications SaaS en facilitant l'intégration, l'extensibilité, l'automatisation et la personnalisation.

Tool Calling

Thu, 16 Jan 2025 09:44:52 GMT

Tool Calling IBM
Tool calling, also known as function calling, is a structured way to give LLMs the ability to make requests back to the application that called it. You define the tools you want to make available to the model, and the model will make tool requests to your app as necessary to fulfill the prompts you give it.The use cases of tool calling generally fall into a few themes:Giving an LLM access to information it wasn't trained with Frequently changing information, such as a stock price or the current weather. Information specific to your app domain, such as product information or user profiles.
Note the overlap with retrieval augmented generation (Retrieval-Augmented Generation), which is also a way to let an LLM integrate factual information into its generations. RAG is a heavier solution that is most suited when you have a large amount of information or the information that's most relevant to a prompt is ambiguous. On the other hand, if retrieving the information the LLM needs is a simple function call or database lookup, tool calling is more appropriate.
Tool calling enhances large language models by appending external tools to it, such as web search, code interpreter, maps, weather ...tool definition : name description input Downsides of traditional tool calling : LLM can make up incorrect tool calling can still hallucinate Contrary to the term, in tool calling, the LLMs do not call the tool/function in the literal sense; instead, they generate a structured schema of the tool.The tool-calling feature enables the LLMs to accept the tool schema definition. A tool schema contains the names, parameters, and descriptions of tools.When you ask LLM a question that requires tool assistance, the model looks for the tools it has, and if a relevant one is found based on the tool name and description, it halts the text generation and outputs a structured response.This response, usually a JSON object, contains the tool's name and parameter values deemed fit by the LLM model. Now, you can use this information to execute the original function and pass the output back to the LLM for a complete answer.Here’s the workflow example in simple words Define a weather tool and ask for a question. For example, what’s the weather like in NY? The model halts text gen and generates a structured tool schema with param values. Extract Tool Input, Run Code, and Return Outputs. The model generates a complete answer using the tool outputs.
This is what tool calling is. For an in-depth guide on using tool calling with agents in open-source Llama 3, check out this blog post: Tool calling in Llama 3: A step-by-step guide to build agents.The provided code demonstrates using tool calling with a weather bot. The bot uses functions get_current_temperature and get_current_wind_speed to retrieve (simulated) weather data. The LLM infers the user wants the temperature in Celsius and suggests calling get_current_temperature.The code then demonstrates capturing the LLM’s tool call, calling the function with its arguments, and adding the result (a dummy value of 22.0) to the conversation history. Finally, it shows how to continue the conversation by feeding the updated history back to the LLM.# Define helper functions for weather (replace with actual data retrieval) def get_current_temperature(location: str, unit: str) -> float: """ Simulates getting the current temperature at a location. Args: location: The location to get the temperature for, in the format "City, Country". unit: The unit to return the temperature in (e.g., "celsius", "fahrenheit"). Returns: A dummy value (22.0) for demonstration purposes. Replace this with actual temperature retrieval logic. """ return 22.0 # Replace with actual temperature retrieval def get_current_wind_speed(location: str) -> float: """ Simulates getting the current wind speed at a location. Args: location: The location to get the wind speed for, in the format "City, Country". Returns: A dummy value (6.0) for demonstration purposes. Replace this with actual wind speed retrieval logic. """ return 6.0 # Replace with actual wind speed retrieval # Define your tool list (functions the LLM can call) tools = [get_current_temperature, get_current_wind_speed] # Assuming you have a pre-loaded tokenizer (replace with your actual loading logic) # ... tokenizer is loaded ... # Set up the conversation history messages = [ {"role": "system", "content": "You are a bot that responds to weather queries. You should reply with the unit used in the queried location."}, {"role": "user", "content": "Hey, what's the temperature in Paris right now?"} ] # Prepare the model input with conversation history, tools, and additional arguments inputs = tokenizer.apply_chat_template( messages, chat_template="tool_use", # Specify the tool calling template tools=tools, add_generation_prompt=True, # Add a prompt to guide the LLM return_dict=True, return_tensors="pt" ) # Move the inputs to the device the model is on (if using GPU) inputs = {k: v.to(model.device) for k, v in inputs.items()} # Generate the LLM response with the prepared inputs out = model.generate(**inputs, max_new_tokens=128) # Decode the LLM output and print it print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):])) # Simulate the LLM suggesting a tool call (based on the conversation) tool_call_id = "vAHdf3" # Random and unique ID for each tool call tool_call = { "name": "get_current_temperature", "arguments": {"location": "Paris, France", "unit": "celsius"} } # Add the suggested tool call to the conversation history messages.append({ "role": "assistant", "tool_calls": [{"id": tool_call_id, "type": "function", "function": tool_call}] }) # Simulate executing the tool call (replace with actual function call) # This retrieves the "dummy" temperature using the defined function temperature = get_current_temperature(tool_call["arguments"]["location"], tool_call["arguments"]["unit"]) # Add the tool call result to the conversation history messages.append({ "role": "tool", "tool_call_id": tool_call_id, "name": "get_current_temperature", "content": str(temperature) # Convert temperature to string }) # Prepare the model input again with the updated conversation history inputs = tokenizer.apply_chat_template( messages, chat_template="tool_use", tools=tools, add_generation_prompt=True, return_dict=True, return_tensors="pt" ) # Move the inputs to the device again inputs = {k: v.to(model.device) for k, v in inputs.items()} # Generate the final LLM response with the updated conversation out = model.generate(**inputs, max_new_tokens=128) # Decode and print the final LLM response (including temperature) print(tokenizer.decode(out[0][len(inputs["input_ids"][0]):])) Tool calling allows LLMs to: Access real-time information (e.g., weather data) Perform calculations (e.g., calculators) Interact with external databases and services Expand their capabilities beyond stored knowledge This makes LLMs more versatile and powerful for various applications. Not all LLMs support tool calling. Check the documentation of your specific model. Libraries like LangChain provide tools and functionalities to simplify LLM communication and tool calling. This tutorial focused on function-based tools. Depending on the LLM, other types of tools (like web APIs) might be supported as well.

tool_calling

Thu, 16 Jan 2025 09:39:33 GMT

unsupervised learning

Mon, 06 Jan 2025 14:35:42 GMT

Keras

Mon, 09 Dec 2024 09:35:01 GMT

Intro : https://keras.io/getting_started/intro_to_keras_for_engineers/Keras 3 is a deep learning framework works with TensorFlow, JAX, and PyTorch interchangeably. This notebook will walk you through key Keras 3 workflows.All Keras models can be trained and evaluated on a wide variety of data sources, independently of the backend you're using. This includes: NumPy arrays Pandas dataframes
TensorFlow tf.data.Dataset objects PyTorch DataLoader objects Keras PyDataset objects The Keras project isn't limited to the core Keras API for building and training neural networks. It spans a wide range of related initiatives that cover every step of the machine learning workflow.
KerasTuner Documentation - KerasTuner GitHub repositoryKerasTuner is an easy-to-use, scalable hyperparameter optimization framework that solves the pain points of hyperparameter search. Easily configure your search space with a define-by-run syntax, then leverage one of the available search algorithms to find the best hyperparameter values for your models. KerasTuner comes with Bayesian Optimization, Hyperband, and Random Search algorithms built-in, and is also designed to be easy for researchers to extend in order to experiment with new search algorithms.
KerasHub Documentation - KerasHub GitHub repositoryKerasHub is a natural language processing library that supports users through their entire development cycle. Our workflows are built from modular components that have state-of-the-art preset weights and architectures when used out-of-the-box and are easily customizable when more control is needed.
KerasCV Documentation - KerasCV GitHub repositoryKerasCV is a repository of modular building blocks (layers, metrics, losses, data-augmentation) that applied computer vision engineers can leverage to quickly assemble production-grade, state-of-the-art training and inference pipelines for common use cases such as image classification, object detection, image segmentation, image data augmentation, etc.KerasCV can be understood as a horizontal extension of the Keras API: the components are new first-party Keras objects (layers, metrics, etc) that are too specialized to be added to core Keras, but that receive the same level of polish and backwards compatibility guarantees as the rest of the Keras API.
AutoKeras Documentation - AutoKeras GitHub repository
AutoKeras is an AutoML system based on Keras. It is developed by DATA Lab at Texas A&M University. The goal of AutoKeras is to make machine learning accessible for everyone. It provides high-level end-to-end APIs such as ImageClassifier or TextClassifier to solve machine learning problems in a few lines, as well as flexible building blocks to perform architecture search.import autokeras as ak clf = ak.ImageClassifier() clf.fit(x_train, y_train) results = clf.predict(x_test)

Guide - Utiliser Mistral

Wed, 04 Dec 2024 14:46:27 GMT

Etape 0 : https://mistral.ai/fr/ et se créer un compte
Par défaut, le modèle utilisé est un modèle de langage textuel.
Il y a plusieurs nouvelles fonctionnalités, décrites en détail dans le post suivant :
https://mistral.ai/fr/news/mistral-chat/ Canvas : un outil d'idéation, où l'on peut éditer des documents texte en collaboration avec un modèle, avec une interface ergonomique où l'on peut faire éditer directement dans la fenêtre ce qui a été généré. Web Search: Permet d'utiliser des informations présentes sur le web pour améliorer les réponses, avec les sites web cités en sources.
Génération d'image: Utilise le dernier modèle open-source Flux pour générer des images en collaboration avec https://blackforestlabs.ai/ Utiliser un Agent pour automatiser des tâches répétitives. Retrieval-Augmented Generation Basique : On peut accompagner le prompt d'images ou de documents PDF pour questionner des documents texte ou analyser des images.

La Plateforme (https://console.mistral.ai/) est une interface web permettant entre autres de créer des agents conversationnels pour des tâches spécifiques que l'on peut ensuite mentionner dans le chat.Elle permet également d'affiner des modèles sur vos données sur les serveurs de calcul Mistral gratuitement (pour l'instant).Commencer par accéder à l'onglet "Facturation" sur la gauche, sous Espace de travail :
Et activer la fonctionnalité "Experiment" pour pouvoir déployer des agents conversationnels sur la plateforme gratuitement.
Documentation sur les agents :Info AI agents are autonomous systems powered by large language models (LLMs) that, given high-level instructions, can plan, use tools, carry out steps of processing, and take actions to achieve specific goals. These agents leverage advanced natural language processing capabilities to understand and execute complex tasks efficiently and can even collaborate with each other to achieve more sophisticated outcomes. Cliquer sur "Create an Agent" sur la page d'accueil pour arriver sur cette interface :
Où l'on peut tester différents modèles. Les modèles déployés dans Le Chat sont utilisables gratuitement.
Choisir un modèle en adéquation à votre problématique, dont la liste et les fonctionnalités sont décrites ici : https://docs.mistral.ai/getting-started/models/models_overview/Par exemple, pour du traitement de texte et d'image, le modèle Pixtral est adapté. Passer la souris sur les (i) adjacents aux paramètres pour leur explication.Une fois les instructions spécifiques saisies, déployer le modèle en cochant la case "Le Chat" pour pouvoir l'utiliser dans Le Chat gratuitement.Un exemple de la documentation est le suivant : Un modèle qui ne répond qu'en français.

D'autres cas d'utilisations sont montrés ici : https://docs.mistral.ai/capabilities/agents/#use-casesPar exemple, j'ai crée un agent "Table reader" qui a pour but d'extraire des données tabulaire d'une photo d'écriture manuscrite et les convertir en CSV à l'aide du modèle Pixtral. La température est paramétrée comme nulle car on ne veut pas de hasard dans la génération ici :
On peut ensuite l'utiliser dans Le Chat avec @Table reader :
Si les résultats sont incorrects, on peut affiner la requête en donnant des exemples ou affiner la question jusqu'à un résultat satisfaisant.

table_reader_ex

Wed, 04 Dec 2024 14:42:56 GMT

Table_reader

Wed, 04 Dec 2024 14:41:30 GMT

french_agent

Wed, 04 Dec 2024 14:36:36 GMT

Pasted image 20241204152811

Wed, 04 Dec 2024 14:28:11 GMT

billing_mistral

Wed, 04 Dec 2024 14:26:53 GMT

La_plateforme_mistral

Wed, 04 Dec 2024 14:23:49 GMT

mistral_pj

Wed, 04 Dec 2024 14:19:40 GMT

Le_chat

Wed, 04 Dec 2024 14:06:22 GMT

Mistral_accueil

Wed, 04 Dec 2024 14:05:16 GMT

A deep-learning framework for enhancing habitat identification based on species composition

Fri, 29 Nov 2024 10:51:11 GMT

paper linkThe European Nature Information System (EUNIS) habitat classification is a comprehensive pan-European system designed to facilitate the harmonized description and collection of data across Europe. It covers all types of habitats, from natural to artificial, and spans terrestrial, freshwater, and marine environments. The classification uses criteria for habitat identification to ensure consistency and comparability of data across different regions and habitat types.
The EUNIS-ESy expert model is an automated classification system designed to categorize European vegetation plots into habitat types defined by the EUNIS Habitat Classification. This system was developed within a contract from the European Environment Agency to Wageningen Environmental Research and Masaryk University. The EUNIS-ESy model uses definitions of individual EUNIS habitats based on their species composition and geographic location, allowing for the classification of vegetation plots into these habitats.
The European Vegetation Archive (EVA) is a centralized database of European vegetation plots developed by the IAVS Working Group European Vegetation Survey. It was established in 2012 and became available for research projects in 2014. EVA integrates national and regional vegetation-plot databases into a single software platform, providing a unified repository of vegetation data for Europe. This archive is crucial for studying plant diversity and habitat changes over time, supporting nature conservation and ecological research across the continent.The raw data in the European Vegetation Archive (EVA) dataset typically consists of detailed records of vegetation plots. Each plot record includes a list of plant species present in a specific small area, usually supplemented by an estimate of the cover of each species. The data may also include geographic coordinates, environmental variables, and other relevant metadata. This structured information allows for comprehensive analysis and classification of vegetation types across Europe.
The study aims to enhance habitat identification in Europe using deep learning techniques. The goal is to develop models capable of assigning vegetation-plot records to the habitats of the European Nature Information System (EUNIS). Training Data: European Vegetation Archive (EVA), containing 886,260 georeferenced plots (~20 species/plot) with 10,481 different species and 228 different habitats. a data repository of vegetation-plot observations (i.e., records of plant taxon co-occurrence and cover-abundance at particular sites in plots ranging from 1 m2 to a few hundred m2 that have been collected by vegetation scientists) from Europe and adjacent areas. Test Data: National Plant Monitoring Scheme (NPMS), an independent data set used to evaluate model performance.
Compute : This work was granted access to the High-Performance Computing (HPC) resources of IDRIS (Institut du Développement et des Ressources en Informatique Scientifique) on the Jean Zay Supercomputer. EUNIS Habitats have a hierarchical mode of classification with multiple layers. This is the data spread for the study, eight habitat groups (level one EUNIS habitats) that were the focus of this study, often referred by their 2020 codes : Littoral biogenic habitats (MA2) — 31,533 vegetation plots; Coastal habitats (N) — 37,574 vegetation plots; Wetlands (Q) — 94,100 vegetation plots; Grasslands and lands dominated by forbs, mosses or lichens (R) — 298,816 vegetation plots; Heathlands, scrub and tundra (S) — 67,494 vegetation plots; Forests and other wooded land (T) — 251,474 vegetation plots; Inland habitats with no or little soil and mostly with sparse vegetation (U) — 8018 vegetation plots; Vegetated man-made habitats (V) — 97,251 vegetation plots.
Observations: Grasslands and lands dominated by forbs, mosses or lichens (R) have the most plots at 298,816 Forests and other wooded land (T) follow with 251,474 plots Inland habitats with no or little soil (U) have the least plots at 8,018 Data Preprocessing: EVA Data: Includes species co-occurrence and cover-abundance estimates. NPMS Data: Used for model validation, incorporating citizen science data. Model Selection and Training: Validation: Spatial block holdout procedure for ten-fold cross-validation to mitigate spatial autocorrelation. Models Evaluated: Multi-Layer Perceptron (MLP) Random Forest Classifier (RFC)
eXtreme Gradient Boosting (XGBoost)
TabNet Classifier (TNC)
Feature Tokenizer+Transformer (FTT) Feature Encoding: Cover-abundance Presence/absence Reciprocal rank Noise Addition: Introduced 30% dropout to enhance robustness. Evaluation Metrics: Top-one micro-average multiclass accuracy Top-three accuracy Deep Learning Application: First application of deep-learning techniques for EUNIS habitat classification. Model Comparison: Rigorous comparison of various machine and deep-learning models. Feature Encoding: Exploration of different encoding techniques to optimize model performance. Noise Robustness: Incorporation of controlled noise to improve model generalization. The study evaluated several machine learning and deep learning models to identify the most effective approach for habitat classification. The models were chosen based on their performance in handling tabular data and their ability to capture complex patterns. The models evaluated include: Multi-Layer Perceptron (MLP): Description: A fully connected feedforward artificial neural network. Working: Passes input data through multiple layers of interconnected nodes with weighted connections and activation functions. Performance: Achieved high accuracy, particularly with reciprocal rank encoding. Random Forest Classifier (RFC): Description: An ensemble of decision trees that uses bagging to improve predictive accuracy and control overfitting. Working: Recursively partitions input data based on feature values to create a tree-like structure. Performance: Showed good performance but was generally outperformed by deep learning models. eXtreme Gradient Boosting (XGB): Description: An optimized gradient-boosting algorithm that iteratively trains decision trees to reduce residual errors. Working: Uses gradient descent optimization, regularization techniques, and hardware-aware optimization. Performance: Demonstrated high accuracy but required significantly longer training times.
**TabNet Classifier (TNC)**: Description: A high-performance and interpretable deep tabular data learning architecture. Working: Selectively attends to informative features using a sparse masking technique and employs a multistep decision-making process. Performance: Showed promising results but did not consistently outperform other models. Feature Tokenizer+Transformer (FTT): Description: Transforms all features to embeddings and applies a stack of transformer layers. Working: Converts features to tokens and processes them through transformer layers. Performance: Effective but did not outperform the MLP in most scenarios. The training procedure involved several key steps to ensure robust and generalizable models: Spatial Cross-Validation: Method: Ten-fold cross-validation with spatial block holdout to account for spatial autocorrelation. Implementation: Vegetation plots were assigned to a grid of 10 km × 10 km cells, randomly sampled for each fold. Hyperparameter Tuning: Approach: Meticulous tuning of each model's main hyperparameters while keeping default configurations for others. Objective: Optimize performance and ensure fairness in model comparison. Feature Encoding: Categorical Variables: Transformed using one-hot encoding. Numerical Features: Left untouched. Species Encoding Techniques: Cover-Abundance: Natural logarithm of raw data, transformed to arithmetic mid-point percent cover. Presence/Absence: Binarization of raw data. Reciprocal Rank: Inverse of the ordinal ranking of species based on cover-abundance. Noise Addition: Method: Introduced 30% dropout to input data to enhance robustness : when evaluating the performance of the models, we gave each present species a 30% chance of being randomly considered absent in the input data. Purpose: Mitigate overfitting and improve generalization by encouraging models to identify transferable patterns. Standardization: Process: Standardized features to a mean of zero and a standard deviation of one for improved numerical stability and model performance. Feature encoding played a crucial role in the model's performance. The study explored three distinct techniques for encoding plant species data: Cover-Abundance: Description: Uses the natural logarithm of raw cover-abundance data. Transformation: Converted scale values to arithmetic mid-point percent cover. Performance: Provided detailed information but was not always the best encoding method. Presence/Absence: Description: Binarizes raw data, converting non-zero entries to one and preserving explicit zeros. Transformation: Simplifies species data to presence or absence. Performance: Useful for models that do not require detailed abundance information. Reciprocal Rank: Description: Uses the inverse of the ordinal ranking of species based on cover-abundance. Transformation: Ranks species in descending order and associates them with the inverse of their position. Performance: Often led to better top-one performance, highlighting the importance of dominant species. Macro-average multiclass accuracy: This metric calculates the overall model performance by averaging the individual class-wise accuracies across all classes (k=1 and k=3). By doing so, habitats with a small number of vegetation plots are given equal weight to those with a large number of plots, ensuring that each habitat contributes equally to the assessment.Tested imabalanced top-one and top-three losses vs cross-entropy loss, which seems to perform better than regular cross-entropy.The best-performing model was a Multi-Layer Perceptron (MLP) with features encoded using the reciprocal rank method. This configuration outperformed other models in terms of top-one micro-average multiclass accuracy and demonstrated a good balance between predictive performance and computational complexity.The paragraph discusses the interpretability of models used for habitat classification, highlighting the use of advanced algorithms like integrated gradients and feature ablation to enhance understanding. These methods help identify which plant species contribute most to the model's output, making it easier for researchers and practitioners to understand the reasoning behind habitat assignments. Key findings include: Vascular plant species contribute significantly (around 85%) to habitat classification. The most dominant species in a vegetation plot are crucial, with the top two species often contributing over 50% of the total importance. Herbaceous species are generally more important than arborescent species, except in forests and wooded areas. Plant species composition alone can maintain or slightly improve model accuracy, even without environmental or location features. Integrated Gradients Integrated Gradients is a technique used to attribute the prediction of a deep learning model to its input features. It helps in understanding which features are most influential in the model's decision-making process. This method is particularly useful for interpreting complex models, such as neural networks, where the relationship between inputs and outputs is not straightforward. How It's Done: Baseline Selection : Choose a baseline input, which represents the absence of a feature. For image data, this could be a black image; for tabular data, it could be zeros or mean values. Gradient Calculation: Compute the gradients of the model's output with respect to the inputs along this path. Integration: Integrate these gradients along the path to get the integrated gradients. This integration step accumulates the gradients, providing a measure of the contribution of each feature to the model's output. The integrated gradients for a feature represent the feature's contribution to the prediction, relative to the baseline. Feature Ablation What It Means: Feature Ablation is a simpler and more intuitive method for understanding the importance of features in a model. It involves systematically removing or altering specific features and observing the change in the model's performance. This helps in identifying which features are critical for the model's predictions. How It's Done: Feature Removal: Select a feature to be "ablated" or removed. This can be done by setting the feature's value to zero, replacing it with a mean or median value, or removing it entirely from the input. Model Evaluation: Run the model with the ablated feature and evaluate its performance using metrics such as accuracy, precision, recall, or loss. Comparison: Compare the model's performance with the ablated feature to its performance with the original feature. A significant drop in performance indicates that the ablated feature is important for the model's predictions. Iteration: Repeat the process for each feature to determine the importance of all features in the model. Feature ablation can be computationally intensive, especially for models with a large number of features, as it requires multiple evaluations of the model.
feature_importance.pngThe paragraph discusses potential improvements for the practical application of a habitat classification framework, highlighting several key points: Species Name Standardization: The use of the GBIF Backbone Taxonomy for standardizing species names ensures consistency across datasets but comes with trade-offs, such as the loss of local taxonomic nuances and potential misclassification of species. This standardization is crucial but can impact the accuracy of habitat classification. Taxonomic Diversity: The model's effectiveness is linked to the diversity of vascular plant species in the training dataset (EVA). The model, trained on a subset of 10,481 species, may struggle with species not included in the training data, leading to classification errors. Expanding the training dataset to include more species could enhance the model's robustness and applicability. Dynamic Habitat Definitions: The framework relies on predefined EUNIS habitats, which are subject to revision and evolution. The dynamic nature of environmental classifications and the impact of climate change and human activities on biodiversity necessitate periodic retraining of the models to adapt to new or changing habitat definitions. Future Enhancements: Suggestions for improvement include broadening the training dataset, incorporating additional data sources, collaborating with experts, and being less cautious during data curation to include rare or ambiguously named species. Additionally, leveraging AI techniques to define new habitat classes and handle non-standardized nomenclature could offer new opportunities for enhancing the framework's versatility and applicability. In conclusion, the deep-learning framework demonstrates high accuracy in assigning vegetation plots to EUNIS habitats, outperforming existing European expert systems. It emphasizes the importance of dominant species and overall species composition, offering flexibility for various applications. The framework's development represents a significant advancement in efficient and accurate habitat classification.The study highlights the effectiveness of deep learning models, particularly the MLP with reciprocal rank encoding, in enhancing habitat classification accuracy.
The framework is shared via a GitHub repository, making it accessible for researchers and practitioners to accurately classify habitats. The results highlight the importance of incorporating advanced technologies into habitat monitoring and showcase the effectiveness of species dominance as a marker of ecosystems.

feature_importance

Fri, 29 Nov 2024 10:26:49 GMT

hdm_framework

Fri, 29 Nov 2024 10:23:27 GMT

Pasted image 20241128111651

Thu, 28 Nov 2024 10:16:52 GMT

Handling API Keys

Wed, 27 Nov 2024 15:10:34 GMT

Handling sensitive API keys securely is crucial to prevent unauthorized access and potential security breaches. Here are some best practices for managing API keys in software: Environment Variables: Store API keys in environment variables rather than hardcoding them into your source code. This keeps them out of version control systems. Example: export API_KEY=your_api_key_here Configuration Files: Use configuration files that are not included in version control. Ensure these files are secured and have appropriate permissions. Example: { "api_key": "your_api_key_here" } Secrets Management Services: Utilize secrets management services like AWS Secrets Manager, Azure Key Vault, or HashiCorp Vault to store and manage API keys securely. These services provide additional features like access control, auditing, and automatic key rotation. Encryption: Encrypt API keys when storing them in configuration files or databases. Use strong encryption algorithms and securely manage encryption keys. Example: from cryptography.fernet import Fernet key = Fernet.generate_key() cipher_suite = Fernet(key) encrypted_api_key = cipher_suite.encrypt(b"your_api_key_here") Access Control: Restrict access to API keys to only those who need them. Implement role-based access control (RBAC) to manage permissions. Ensure that API keys are not shared or exposed unnecessarily. Key Rotation: Regularly rotate API keys to minimize the risk of exposure. Automate the rotation process using secrets management services. Example: # Rotate API key using AWS Secrets Manager aws secretsmanager rotate-secret --secret-id your_secret_id Audit and Monitoring: Monitor the usage of API keys to detect any unusual activity. Implement logging and alerting mechanisms to track access and changes. Regularly audit access logs to ensure compliance with security policies. Least Privilege Principle: Grant the minimum necessary permissions to API keys. Avoid using overly permissive keys that can access more data or perform more actions than required. Example: { "api_key": "limited_access_api_key" } Secure Communication: Always use secure communication protocols (e.g., HTTPS) when transmitting API keys over the network. Ensure that API endpoints are secured with proper authentication and authorization mechanisms. Code Reviews and Security Audits: Conduct regular code reviews and security audits to identify and mitigate potential vulnerabilities related to API key management. Use static analysis tools to detect hardcoded secrets in the codebase. By following these best practices, you can significantly reduce the risk of API key exposure and enhance the overall security of your software.Storing environment variables can be done in several ways, depending on your needs and the context in which you're working. Here are some common methods:For personal or development environments, you can store environment variables in shell configuration files like .zshenv, .bashrc, or .bash_profile. This is useful for variables that need to be available across different projects and sessions.For project-specific environment variables, it's often better to store them within the project directory. This keeps the variables isolated to the project and makes it easier to manage different environments for different projects.A common practice is to use a .env file within your project directory. This file can be read by various tools and frameworks to set environment variables.Example .env file:MISTRAL_API_KEY=your_key_here Loading the .env file: Using a Tool like dotenv: Many programming languages and frameworks have libraries to load .env files. For example, in Python, you can use the python-dotenv library: from dotenv import load_dotenv import os load_dotenv() # Load environment variables from a .env file api_key = os.getenv('MISTRAL_API_KEY') print(api_key) Manually in a Shell Script: You can source the .env file in a shell script: set -a source .env set +a There are tools specifically designed for managing environment variables, such as: Direnv: Automatically loads and unloads environment variables based on your current directory. EnvKey: A service for managing environment variables securely. For continuous integration and continuous deployment (CI/CD) pipelines, environment variables are often stored in the pipeline configuration. Most CI/CD tools (e.g., GitHub Actions, GitLab CI, Jenkins) allow you to set environment variables securely.If you're deploying to a cloud service, many platforms provide ways to set environment variables: AWS: Use AWS Systems Manager Parameter Store or AWS Secrets Manager. Azure: Use Azure Key Vault. Google Cloud: Use Google Cloud Secret Manager. If you decide to store the environment variable in a project file, here’s how you might do it: Create a .env file in your project directory: touch .env Add your environment variable to the .env file: echo 'MISTRAL_API_KEY=your_key_here' >> .env Load the .env file in your application: Python Example: from dotenv import load_dotenv import os load_dotenv() # Load environment variables from a .env file api_key = os.getenv('MISTRAL_API_KEY') print(api_key) Node.js Example: require('dotenv').config(); const apiKey = process.env.MISTRAL_API_KEY; console.log(apiKey); Do not commit sensitive information: Ensure that your .env file is not committed to version control. Add it to your .gitignore file: echo '.env' >> .gitignore Use secure storage: For production environments, consider using secure storage solutions like AWS Secrets Manager, Azure Key Vault, or Google Cloud Secret Manager. By following these guidelines, you can effectively manage and store environment variables for your projects.

Jupyter Kernel bugs

Tue, 26 Nov 2024 13:53:44 GMT

Here are several ways to ensure Jupyter is using the correct Conda environment:# Activate your desired environment conda activate biotope-3 # Install ipykernel conda install ipykernel # Add the environment as a Jupyter kernel python -m ipykernel install --user --name biotope-3 --display-name "Python (biotope-3)" -> Select in Dataspell the correct kernel that runsIn a Jupyter notebook, you can check the current environment:import sys import os # Show current Python executable path print(sys.executable) # Show current environment print(os.environ.get('CONDA_PREFIX')) If kernels don't match, try removing and reinstalling # Remove a specific kernel jupyter kernelspec uninstall biotope-3 # Reinstall as shown in Method 1 When you open Jupyter, check the kernel dropdown Select "Python (biotope-3)" if available Restart kernel to ensure it's using the correct environment

Hugging Face Transformers library

Tue, 29 Oct 2024 14:21:27 GMT

The Hugging Face Transformers library is a popular open-source library that provides pre-trained models for natural language processing (NLP) tasks. It is built on top of PyTorch and TensorFlow, offering a wide range of architectures and pre-trained weights for various NLP tasks such as text classification, question answering, language translation, and more.To get started with the Transformers library, you can install it using pip:pip install transformers The Transformers library offers several key features that make it a powerful tool for NLP tasks: Pre-trained Models: The library provides a vast collection of pre-trained models, including BERT, RoBERTa, DistilBERT, T5, and many others. These models have been trained on large datasets and can be fine-tuned for specific tasks. Model Architectures: The library supports a wide range of model architectures, including transformers, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). This allows you to choose the architecture that best suits your task. Tokenizers: The library includes tokenizers for various languages and models, making it easy to preprocess text data for input into the models. Datasets: The library integrates with the Hugging Face Datasets library, providing access to a wide range of datasets for training and evaluation. Trainer API: The library provides a high-level Trainer API that simplifies the training and evaluation of models. The Trainer API handles many of the details of training, such as data loading, optimization, and evaluation. For more detailed information, you can refer to the official Hugging Face Transformers documentation. The documentation provides comprehensive guides, tutorials, and API references to help you get the most out of the library.Additionally, the Hugging Face community is very active, and you can find many resources, including forums, blog posts, and code examples, to help you with your NLP projects.The Hugging Face Transformers library is a powerful and flexible tool for natural language processing. With its wide range of pre-trained models, model architectures, and easy-to-use APIs, it enables developers to build and deploy state-of-the-art NLP applications with ease. Whether you are a beginner or an experienced NLP practitioner, the Transformers library has something to offer you.

LangChain 🦜

Tue, 29 Oct 2024 13:43:06 GMT

LangChain is an open-source (MIT License) framework designed to simplify the development of applications that leverage language models. It provides a suite of tools and abstractions that make it easier to build, test, and deploy language model-based applications.
Here's a summary of what LangChain does: Modular Components: LangChain offers modular components that can be combined to create complex language model applications. These components include prompts, models, chains, agents, memory, and more. Prompt Management: LangChain helps manage prompts, which are the inputs given to language models. It provides tools to create, format, and optimize prompts for better results. Chains: LangChain allows chaining together multiple calls to language models, enabling complex, multi-step workflows. These chains can be simple sequential calls or more complex branching logic. Agents: LangChain provides agents that can use tools and make decisions based on the outputs of language models. Agents can perform tasks like web browsing, API interaction, or even using other language models. Memory: LangChain includes memory systems that allow language models to maintain context across multiple interactions. This is crucial for building conversational agents and other stateful applications. Evaluation: LangChain offers tools for evaluating and testing language model applications, helping developers to iterate and improve their systems. Integration: LangChain is designed to be model-agnostic, allowing it to integrate with various language models and APIs, including those from Hugging Face, OpenAI, and others. In essence, LangChain is a framework that helps developers harness the power of language models more effectively, enabling the creation of sophisticated AI applications with less effort.

langchain_framekwork

Tue, 29 Oct 2024 13:40:12 GMT

Multi-task Learning

Tue, 15 Oct 2024 15:12:15 GMT

Definition Multi-task Learning is a sub-field of Machine Learning and by extension, of deep learning. While it is somewhat similar in concept to Multi-modal Learning , there are a few differences. The general idea of multi-task learning, is to share the input, a model or layers of a model in order to build a shared representation of some input data that can be used to solve multiple tasks or problems using the same input data. This can be useful when data is limited or when we explicitly want to build a shared representation. It works only if the data is relevant to all the tasks, and there is the same amount of label data for each task.There are multiple benefits to this approach, notably : We train a single large model for multiple problems, so inference is faster and the architecture is more compact. The label data can be sparse, meaning we can train on Task 1 with some Task 2 labels missing and vice-versa. There is implicit regularization by the fact we optimize the loss of the ensemble of tasks and not a single one, limiting overfitting to a task. Example One example is multi-lingual translation, where the input text is the same, but is translated in different languages. This in turn builds a shared representation of multiple languages, if we manage to train the model successfully.
There are two main ways to share parameters, either by Hard parameter sharing, where the layer shares all its parameters for all tasks, or Soft parameter sharing, where the layers are constrained by eachother so the weights are similar but not identical.There are two main ways to share parameters in multi-task learning: Hard parameter sharing and Soft parameter sharing.In Hard parameter sharing, all layers of a model share their weights for all tasks. This means that if we want to add or remove a task from the training set, we need to retrain the entire network with new weights. In Soft parameter sharing, layers of different models share only their weights within each other, but not across tasks. This means that if a task is added or removed from the training set, we can retrain individual components without affecting all other tasks.The loss is usually simply computed by weighted summation, so that the total loss is, optionally with a regularization term : where: is the weight vector represents the parameters of each task (e.g. weights and biases) is the loss function for each individual task is the number of tasks is a regularization term that encourages the model to generalize well across all tasks The weight vector can be learned during training by minimizing the total weighted sum of losses over all tasks. The goal is often to find the optimal values for and such that both individual task performance () and overall generalization () are minimized.Neural Network InputShared layersTask 1Task 2Task N

soft_vs_hard_param_sharing

Tue, 15 Oct 2024 14:13:24 GMT

multitask_general

Tue, 15 Oct 2024 14:10:22 GMT

Blockwise splitting

Wed, 02 Oct 2024 09:51:53 GMT

Blockwise splitting is a strategy for partitioning a dataset into training and test sets while preserving spatial relationships. It involves dividing the data into blocks or patches of a specified size and then randomly assigning blocks to either the training or test set. This ensures that the training and test sets are geographically diverse, preventing bias and improving generalization.Stratified sampling is a statistical method used to ensure that a sample is representative of a diverse population. It involves dividing the population into subgroups or strata based on specific characteristics and then randomly selecting samples from each stratum. This method helps to prevent bias and improve the accuracy of research findings.

Why LLMs suck at counting

Tue, 01 Oct 2024 15:30:43 GMT

Resource : # Why vector search is not enough and we need BM25 Numbers are processed as their value in language instead of their numerical value.
As such, the semantic position of "100" in the Latent spaces of the LLM is not necessarily in between "150" and "50", as they are treated as words more than numericals.
Another factor is that large numbers can be tokenized differently depending on the model, resulting in a different embedding.

number_tokenization

Tue, 01 Oct 2024 15:30:33 GMT

Semantic_position_numbers

Tue, 01 Oct 2024 15:27:09 GMT

Prompt Engineering

Tue, 01 Oct 2024 15:22:16 GMT

Prompt engineering is a crucial skill in the era of large language models (LLMs). It involves crafting effective inputs to guide AI models towards producing desired outputs. This article explores key techniques and best practices in prompt engineering.Zero-shot prompting refers to the ability of an AI model to perform a task without any specific examples or training for that task.

(figures source) Relies on the model's pre-existing knowledge Useful for straightforward tasks or when examples aren't available Can be less accurate for complex or nuanced tasks Example:Translate the following English text to French: "Hello, how are you?" Few-shot prompting involves providing the model with a small number of examples before asking it to perform a task.
Improves performance on specific or nuanced tasks Helps the model understand the desired format or style Typically more effective than zero-shot for complex tasks Example:Translate English to French: English: Good morning French: Bonjour English: How are you? French: Comment allez-vous? English: Have a nice day French: [Your translation here] Chain-of-thought prompting encourages the model to break down complex problems into steps, mimicking human reasoning. Improves performance on multi-step or logical tasks Helps in understanding the model's reasoning process Useful for catching and correcting errors in logic Example:Solve this word problem step by step: If a train travels 120 km in 2 hours, what is its average speed in km/h? Step 1: [Model's step-by-step reasoning] Step 2: [Continues reasoning] ... Final Answer: [Model's conclusion] Be Specific: Clearly state the task, desired output format, and any constraints. Use Context: Provide relevant background information when necessary. Iterate and Refine: Test prompts and refine based on the outputs. Avoid Ambiguity: Use precise language to prevent misinterpretation. Leverage Model Knowledge: Phrase prompts to tap into the model's pre-existing knowledge. Consider Ethical Implications: Be mindful of potential biases and ethical concerns in prompt design. Use Role-Playing: Assign a specific role or persona to the AI for specialized tasks. Combine Techniques: Mix different prompting strategies for complex tasks. By mastering these techniques and following best practices, you can significantly improve the effectiveness of your interactions with AI language models, leading to more accurate, relevant, and useful outputs.While prompt engineering is a powerful tool for improving AI model outputs, it's important to understand its limitations and recognize some overused methods that may not always be effective. Model Capabilities: No amount of prompt engineering can make a model perform tasks beyond its fundamental capabilities or knowledge base. Consistency: Even with well-crafted prompts, model outputs can be inconsistent, especially for complex tasks. Bias Amplification: Poorly designed prompts can inadvertently amplify biases present in the model's training data. Generalization: Prompts optimized for specific tasks or datasets may not generalize well to new situations. Computational Cost: Complex prompting techniques (e.g., few-shot with many examples) can increase token usage and processing time. Prompt Sensitivity: Small changes in prompt wording can sometimes lead to significant changes in output, making it challenging to maintain reliability. Limited Context Window: The maximum input length restricts the amount of context or examples that can be included in a prompt. Excessive Few-Shot Examples: Overloading prompts with too many examples can be counterproductive and may not improve results proportionally. Overly Complex Instructions: Extremely detailed or convoluted instructions can confuse the model rather than guide it effectively. Reliance on Specific Phrases: Overusing phrases like "You are an expert in..." or "Respond as if you were..." may not significantly enhance performance. Ignoring Model Versions: Applying techniques optimized for one model version across different versions or models without adjustment. Neglecting Task-Specific Tuning: Over-relying on general prompt techniques without considering the unique aspects of specific tasks. Prompt Chaining Without Validation: Excessively chaining prompts without validating intermediate outputs, potentially compounding errors. Overemphasis on Formatting: Focusing too much on output formatting at the expense of content quality. Neglecting Ethical Considerations: Overlooking potential ethical implications of prompts, especially in sensitive domains. To overcome these limitations and avoid overused methods: Understand the model's core capabilities and limitations. Regularly test and validate prompt effectiveness. Use the simplest effective prompt for each task. Consider fine-tuning models for specific applications when appropriate. Stay updated on new prompting techniques and best practices. Prioritize ethical considerations in prompt design. Combine prompt engineering with other AI development techniques for optimal results. By recognizing these limitations and avoiding overreliance on certain methods, practitioners can use prompt engineering more effectively as part of a broader AI development strategy.

few_shot

Tue, 01 Oct 2024 15:17:23 GMT

zero_shot

Tue, 01 Oct 2024 15:16:41 GMT

Multi-output Regression Neural Network

Wed, 25 Sep 2024 08:29:12 GMT

An OutputRangeLayer is used to scale the network's outputs to a specific range. Implementation: class OutputRangeLayer(tf.keras.layers.Layer): def __init__(self, output_range, name=None): super(OutputRangeLayer, self).__init__(name=name) self.output_range = output_range def call(self, inputs): min_val, max_val = self.output_range return inputs * (max_val - min_val) + min_val This layer applies a linear transformation to scale inputs (assumed to be in [0,1]) to the specified range. The OutputRangeLayer correctly handles gradient flow during backpropagation. Gradients are automatically scaled proportionally to the output range. No additional gradient scaling is needed in the loss calculation. The layer doesn't introduce gradient vanishing or exploding issues. Target values should be scaled to match the range of the corresponding OutputRangeLayer. Consistency between target scaling and network output range is crucial. Scaling targets to match output range: Improves learning efficiency Ensures consistent loss calculation Helps with gradient flow and optimization stability Scaling inputs and targets generally improves neural network training: Enhances gradient flow Stabilizes loss calculation Improves optimization stability Increases compatibility with activation functions Reduces numerical precision issues Common scaling ranges like [0,1] or [-1,1] are often effective, but the specific range is less critical than consistency across the model. Scale targets to match the OutputRangeLayer range for each output. Ensure consistency between target scaling and network output scaling. Consider the nature of your data and problem when choosing scaling ranges. Monitor training process to confirm effective learning, especially in early layers. By following these principles, you can ensure effective training and performance of your multi-output neural network, regardless of the specific output ranges chosen for each branch.

NN_archi_multi_output

Wed, 25 Sep 2024 08:22:00 GMT

Beyond neural scaling laws (paper)

Mon, 16 Sep 2024 11:39:43 GMT

Original paper : Beyond neural scaling laws
Scaling laws : Scaling Laws for Neural language models
Video : Welch Labs on neural scaling lawsNeural Scaling Laws: These are empirical observations about how neural network performance changes as you scale various factors like model size, dataset size, or compute resources.Power laws for scaling neural networks => More data is better but loss follows the following power law :With N being the number of examples, and nu is a problem and model-dependent factor.
The Pareto frontier is the theoretical limit of possible optimization in a multi-objective function. In other words, a Pareto front represents the set of optimal solutions where you can't improve one aspect without worsening another. In the context of neural networks, it often involves trade-offs between different performance metrics or resources.Authors prune data to remove redundant and less informative data to improve scaling law. In theory, pruning the easy examples and only keeping the hard examples makes the model learn faster. They generate a metric to determine hard vs easy examples beforehand to prune efficiently, using a pre-trained model to embed the image data and then cluster the embeddings using k-means.

Object-Oriented Programming in Python, A Quick Guide

Tue, 10 Sep 2024 10:04:02 GMT

Welcome to this quick guide on Object-Oriented Programming (OOP) in Python! If you're coming from a Java background, you'll find many similarities, but there are also some key differences. Let's dive in! Classes and Objects
Attributes and Methods
Constructors
Inheritance
Polymorphism
Encapsulation
Special Methods In Python, you define a class using the class keyword. Here's a simple example:class Dog: pass pass In Python, the pass statement is a null operation — when it is executed, nothing happens. It is used as a placeholder in loops, functions, classes, or in places where your code will eventually go. It allows you to create a block of code that syntactically needs to be there but doesn't do anything yet. To create an object (instance) of the class, you simply call the class name followed by parentheses:my_dog = Dog() Attributes and methods are defined within the class. Attributes are variables that belong to the class, while methods are functions that belong to the class.class Dog: # Attribute species = "Canis familiaris" # Method def bark(self): print("Woof!") To access attributes or methods, you use the dot notation:my_dog = Dog() print(my_dog.species) # Output: Canis familiaris my_dog.bark() # Output: Woof! In Python, the constructor method is called __init__. It's similar to the constructor in Java.class Dog: def __init__(self, name, age): self.name = name self.age = age To create an object with the constructor:my_dog = Dog("Fido", 3) print(my_dog.name) # Output: Fido print(my_dog.age) # Output: 3 Python supports inheritance, allowing one class to inherit attributes and methods from another.class Animal: def __init__(self, name): self.name = name def speak(self): pass class Dog(Animal): def speak(self): print("Woof!") To create an object of the subclass:my_dog = Dog("Fido") my_dog.speak() # Output: Woof! Polymorphism allows methods to do different things based on the object it is acting upon. In Python, this is achieved through method overriding.class Cat(Animal): def speak(self): print("Meow!") To see polymorphism in action:my_cat = Cat("Whiskers") my_cat.speak() # Output: Meow! Python doesn't have strict access modifiers like Java, but you can use naming conventions to indicate private attributes and methods.class Dog: def __init__(self, name, age): self._name = name # Protected attribute self.__age = age # Private attribute def get_age(self): return self.__age To access the private attribute:my_dog = Dog("Fido", 3) print(my_dog.get_age()) # Output: 3 In Python, there are no strict access modifiers like in Java, but there are conventions to indicate the intended access level of attributes and methods.Protected attributes are intended to be accessed only within the class and its subclasses. By convention, protected attributes are prefixed with a single underscore (_).class MyClass: def __init__(self): self._protected_attribute = 42 def _protected_method(self): print("This is a protected method.") While Python does not enforce protection, the underscore is a signal to other developers that the attribute or method is intended to be protected.Private attributes are intended to be accessed only within the class itself. By convention, private attributes are prefixed with double underscores (__).class MyClass: def __init__(self): self.__private_attribute = 42 def __private_method(self): print("This is a private method.") Python performs name mangling for private attributes and methods, which means it internally changes the name to include the class name. This helps to avoid name collisions when implementing inheritance.Python has special methods (also known as magic methods) that start and end with double underscores. These methods allow you to define how your objects behave in certain situations.class Dog: def __init__(self, name, age): self.name = name self.age = age def __str__(self): return f"{self.name} is {self.age} years old." To use the special method:my_dog = Dog("Fido", 3) print(my_dog) # Output: Fido is 3 years old. That's it! You're now ready to start coding in Python using OOP principles. Happy coding! 🚀

Multioutput Regression

Fri, 06 Sep 2024 12:52:40 GMT

In machine learning we often encounter regression, these problems involve predicting a continuous target variable, such as house prices, or temperature. However, in many real-world scenarios, we need to predict not only single but many variables together, this is where we use multi-output regression.

multioutput_regerssor

Fri, 06 Sep 2024 12:52:32 GMT

Latent spaces

Tue, 03 Sep 2024 12:59:21 GMT

https://digital-garden-betheve-ff4539b5328d87e722420cea05c7e2905bd94833.gitpages.huma-num.fr/lib/media/visualization-of-the-deepsdf-latent-space-using-t-sne.mp4# What is a latent spaceSources :
Latent and Embedding Space
Latent Space (wikipedia)
source : Visualization of the DeepSDF latent space using t-SNELatent space is a lower-dimensional space into which high-dimensional data transforms. Projecting a vector or matrix into a latent space aims at capturing the data’s essential attributes or characteristics in fewer dimensions.
The simplest deep learning architecture using a latent space is that of the Autoencoders, which follows an encoder-decoder concept. The latent space is the lowest dimension layer, in other terms, the one with the least neurons.
This Latent Space holds the embedded representation of the high-dimension input vectors into a vector space which holds these compressed representations, also known as vector embeddings. To embed here means that the model learns to reduce the size of the data while maintaining the most information possible, similar to compression.In other words, the back and forth encoding-decoding training converges towards a summarized version of the input data, which is no longer directly human-readable, but where the similar data is close to each other in this space, and dissimilar data is further.For a more formal definition :
Latent Space A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another. Position within the latent space can be viewed as being defined by a set of latent variables that emerge from the resemblances from the objects. Summary In short, a latent space is a more compact representation of the data.
Bonus The idea that high-dimensionality data can be compressed while retaining most of its information and why machine learning works so well is called The Manifold Hypothesis Visual Resources :
Latent Space Visualisation: PCA, t-SNE, UMAP | Deep Learning Animated
Variational Autoencoder (VAE) Latent Space Visualization
Google : A.I. Experiments: Visualizing High-Dimensional Space
On the fashion MNIST Dataset, we can visualize the latent space to understand some relationships between data, where similar data is clustered together. From this we can deduce some transformation vectors that go from flip-flop images to formal shoes images.

Before Generative Adversarial Networks or Diffusion Models, one of the ways we can generate synthetic data is by doing inverse transforms projections of latent space vectors. In other words, we can reverse-engineer the input by using inverse transforms to generate new data like such :

image sourceFor example, for image generation, we would take a sample point from the final latent space, and use the decoder part of the network to generate a totally new image that is within the bounds of the latent space. In other words, if we trained a model on cats and dogs pictures, we can only generate cats and dogs pictures, or something in-between a cat and a dog.Computing the in-between of two vectors or points in a given space, is called interpolation, therefore, in latent space, we have latent interpolation.An interpolation in the latent space between multiple vectors can yield the intermediate states to go from one state to another.This is what is illustrated at the beginning of this article with this video : By navigating the latent space, we can go from the chair cluster to the table and couch clusters to get how to morph one thing into another.There are many ways to do that, that achieve different goals, depending on how you compute the distance ; in other words, which metric is used.Bonus : Optimal transport Something that achieves similar results but with a different approach :
Optimal Transport is a sub-field of Machine Learning that can compute intermediary states between statistical distributions.
It can also be applied to images or any signals if they are considered to be a sample from a statistical distribution, as is assumed in most cases. It does not rely on latent spaces though. It uses Wasserstein Distance instead of Euclidian distance to compute the intermediate states, also called barycenters. This is illustrated in the following figure. Three-dimensional shape interpolation. The four corner shapes are represented using normalized indicator functions on a 60 × 60 × 60 volumetric grid; barycenters of the distributions are computed using bilinear weights. Once our latent space is established, we can reverse-engineer the input data to produce variations of the input given a feature.For example, here, we take a non-smiling face image as input, and apply the "smiling" transformation to the encoded input vector corresponding to the input image, and we can reconstruct a smiling face image from the baseline !
We can go further and chain those operations to get specific results :

digital_garden

Tue, 03 Sep 2024 12:18:04 GMT

Attention

Tue, 03 Sep 2024 09:26:26 GMT

In image processing, attention modules mimic human visual attention by selectively focusing on informative parts of the image and suppressing irrelevant ones.expand_more Imagine looking at a photo with multiple objects, your attention might naturally shift towards the main subject, subconsciously filtering out background details. Attention modules aim to replicate this behavior, improving the efficiency and accuracy of learning in deep learning models.expand_moreHere's a breakdown of how they generally work:1. Feature Extraction: The first step is similar to other deep learning methods: extracting features from the image. This might involve using convolutional layers to identify edges, textures, and other patterns.2. Attention Calculation: Here, the attention module comes into play. It takes the extracted features and performs operations to generate an "attention map" This map assigns weights (often between 0 and 1) to different parts of the feature map, highlighting important regions and de-emphasizing others.There are various ways to calculate this attention map: Channel Attention: This method focuses on the importance of different feature channels (think of channels as capturing different aspects of the image, like color or texture). It analyzes the interdependencies between channels and emphasizes informative ones based on their relevance to the task. Spatial Attention: This method focuses on the importance of different spatial locations within the feature map. It analyzes the relationships between pixels and emphasizes areas carrying relevant information. 3. Re-weighting features: The attention map is then used to modulate (adjust) the original features. This can be done by: Multiplication: Each element in the feature map is multiplied by the corresponding value from the attention map. Higher values in the attention map result in a larger impact on the original feature. Additive Attention: The attention map is added to the original feature map, effectively adding emphasis to specific features. Benefits of Attention Modules: Improved performance: By focusing on informative parts of the image, the model can learn more effectively and achieve higher accuracy in tasks like image classification, segmentation, and object detection. Interpretability: Attention maps can be visualized, providing insights into what parts of the image the model is focusing on for its predictions. Overall, attention modules are a powerful tool in image processing by mimicking selective human attention and enhancing the learning process in deep learning models.While attention mechanisms in both NLP and image processing share the core principle of focusing on relevant parts of the input, they differ in the nature of the data they handle and how they compute attention:Nature of the Data: NLP: Deals with sequential data like sentences. Each word carries meaning, and its relationship with surrounding words is crucial for understanding the overall context. Image Processing: Deals with spatial data like images. Here, pixels are arranged in a grid, and their spatial relationships (e.g., proximity, forming edges) are crucial for understanding the content. Attention Calculation: NLP: Often uses self-attention, which compares each element in the sequence with all other elements. This helps the model understand the relationships between words, regardless of their distance in the sequence. For example, in the sentence "The quick brown fox jumps over the lazy dog," the model might use self-attention to understand the connection between "quick" and "fox" even though they are not adjacent. Image Processing: Primarily focuses on spatial attention and channel attention. Spatial attention considers the relationship between pixels in the image, while channel attention considers the importance of different feature channels extracted from the image. For example, in an image of a cat, spatial attention might focus on areas with edges and textures to identify the cat's shape, while channel attention might emphasize channels capturing color and texture to distinguish the cat from the background.Here's a table summarizing the key differences:Overall, the differences stem from the fundamentally different nature of the data being processed. NLP models leverage self-attention to capture long-range dependencies in sequences, whereas image processing models utilize spatial and channel attention to understand the intricate relationships between pixels and feature channels in an image.

Clustering

Tue, 03 Sep 2024 09:25:51 GMT

Clustering is a type of unsupervised learning algorithm that aims to divide a dataset of unlabeled data points into groups of similar points. It is a powerful tool for uncovering hidden patterns and relationships in data without relying on any pre-existing labels or categories. It utilizes the inherent structure and relationships within the data to generate groupings based on similarity.Clustering stands in contrast to supervised learning approaches, which require labeled data to train a model to predict a target variable. Supervised algorithms learn from labeled examples and establish a relationship between the input features and the desired output. In contrast, clustering algorithms operate on unlabeled data and aim to identify patterns and groupings without the guidance of a predefined target variable.Example 1: Customer Segmentation with Spatial DataConsider a retail company with customer data that includes their home addresses, purchase history, and demographics. Using clustering, the company can identify customer segments based on their geographic proximity, purchasing habits, and demographic characteristics. This information can be used to tailor marketing campaigns, optimize store locations, and improve customer service.Example 2: Species Identification with Ecological DataIn ecology, clustering can be used to regroup species based on their physical characteristics, habitat preferences, and genetic makeup. This information can be used to study biodiversity, understand ecological relationships, and identify endangered species.Centroid based algorithm like K-Means :

Hierarchical clustering (source) :

Density-based spatial clustering of applications with noise animation (source) :

UMAP is not a clustering algorithm per-se but can still be used as a clustering aglorithm for vizualisation purposes as it generates indirectly clusters through a proximity graph-based approach similar to k-means.[T-SNE] Also displays clusters but is less reliable as a clustering method as it is not built for clustering but for data vizualisation of high-dimension data.Fuzzy clustering is a type of unsupervised machine learning algorithm that partitions data points into clusters. Unlike traditional clustering algorithms, which assign each data point to a single cluster, fuzzy clustering allows each data point to belong to multiple clusters with varying degrees of membership. This makes it a more flexible and versatile approach for handling complex data sets with overlapping clusters.
Items in clusters should be as similar as possible to each other and as dissimilar as possible to items in other groups. Computationally, it’s much easier to create fuzzy boundaries than it is to settle on one cluster for one point. Fuzzy clustering uses least-squares solutions to find the optimal location for any data point. This optimal location may be in a probability space between two (or more) clusters.
An example of fuzzy clustering, where the middle point can belong to either group A or BKey Characteristics of Fuzzy Clustering Degrees of Membership: Each data point is assigned a membership value for each cluster, indicating the strength of its association with that cluster. These membership values can range from 0 to 1, where 0 represents no membership and 1 represents full membership. Soft Boundaries: Fuzzy clustering does not have distinct boundaries between clusters. Instead, there are gradual transitions between membership values, allowing data points to belong to multiple clusters simultaneously. Flexibility: Fuzzy clustering is well-suited for data sets with overlapping or arbitrarily shaped clusters. It can capture the nuances of complex data relationships more effectively than traditional clustering methods. Applications of Fuzzy Clustering Image Segmentation: Fuzzy clustering is used to partition images into regions with similar characteristics, such as color, texture, or intensity. It is often employed in image processing and analysis tasks. Customer Segmentation: Fuzzy clustering can be used to group customers based on their preferences, demographics, or purchase behavior. This information can be valuable for targeted marketing campaigns and personalized customer experiences. Pattern Recognition: Fuzzy clustering is applied in pattern recognition tasks to classify objects or data points based on their features. It can handle ambiguous or overlapping patterns more effectively than traditional methods. Example of Fuzzy ClusteringConsider a data set representing students' scores in three subjects: mathematics, science, and English. Traditional clustering algorithms might assign each student to a single category, such as "strong math, weak science, average English." However, fuzzy clustering allows for a more nuanced representation, where a student could be classified as "strong in math and science, average in English." This more flexible approach captures the fact that students may possess varying strengths in different subjects.Popular Fuzzy Clustering Algorithms Fuzzy C-Means (FCM): FCM is a widely used fuzzy clustering algorithm that minimizes the within-cluster variance to find optimal cluster centers and membership values. Fuzzy K-Means: Fuzzy K-Means is a variant of FCM that allows for a predefined number of clusters (K). It aims to minimize a fuzzy objective function that considers both cluster compactness and data points' overall membership distribution.
Fuzzy ARTMAP: Fuzzy ARTMAP is a self-organizing neural network that combines fuzzy logic and reinforcement learning for adaptive clustering. It continuously learns and adapts to new data, making it suitable for dynamic environments.
Gustafson-Kessel (GK) algorithm: associates a data point with a cluster and a matrix. While C-means assumes the clusters are spherical, GK has elliptical-shaped clusters. Gath-Geva algorithm (also called Gaussian Mixture Decomposition): similar to FCM, but clusters can have any shape. Fuzzy clustering has proven to be a valuable tool in various machine learning applications due to its ability to handle overlapping clusters and provide more flexible representations of data relationships. Its flexibility and versatility make it a suitable choice for complex data analysis tasks, particularly in areas like image processing, pattern recognition, and customer segmentation.

Machine Learning

Tue, 03 Sep 2024 09:24:30 GMT

The term "machine" in machine learning refers to the computational system or algorithm that is capable of learning from data and improving its performance over time. This distinction between "learning" and "machines" is crucial to understanding the core concept of machine learning.The Origin of "Machine" in Machine LearningThe term "machine learning" was coined by Arthur Samuel, an IBM researcher, in 1959. He used it to describe his work on a checkers program that could improve its performance by analyzing its own games and learning from its mistakes.Prior to Samuel's use of the term, there were already a number of researchers working on similar concepts. However, they often used different terminology, such as "learning machines" or "self-teaching computers." Samuel's contribution was to coin the term "machine learning" and popularize its use.What Does "Machine" Mean in Machine Learning?In the context of machine learning, the term "machine" does not refer to a specific physical device or hardware platform. Instead, it refers to the computational system or algorithm that is responsible for learning from data. This system could be implemented on a variety of hardware platforms, including computers, laptops, smartphones, or even specialized hardware accelerators.The key property that distinguishes a "machine" in machine learning is its ability to learn from data. This means that the machine can extract patterns and insights from data that can be used to improve its performance on a given task. For example, a machine learning model for image classification can learn to identify different objects by analyzing a large dataset of labeled images.The ability to learn from data is what makes machine learning so powerful and versatile. It allows machines to adapt to new situations and environments, without the need for explicit programming. This makes machine learning a promising tool for a wide range of applications, including: Predictive modeling: Machine learning can be used to predict future events or outcomes based on historical data. For example, it can be used to predict customer behavior, financial trends, or natural disasters. Pattern recognition: Machine learning can be used to identify patterns and anomalies in data. For example, it can be used to detect fraud, spam, or disease outbreaks. Optimization: Machine learning can be used to optimize systems by finding the best possible solutions to complex problems. For example, it can be used to optimize the placement of servers in a data center, or the design of a drug molecule. The term "machine learning" is a powerful and evocative one, and it accurately captures the essence of this field of computer science. Machine learning algorithms are like little machines that can learn and improve over time, just like humans do. This ability to learn makes machine learning a powerful tool for solving a wide range of problems in a variety of domains.

decision tree

Tue, 03 Sep 2024 09:21:57 GMT

Definition: Decision trees are supervised learning algorithms that use a tree-like structure to classify data or make predictions. Each node represents a feature in the data, and each branch represents a possible value of that feature. By asking a series of binary questions at each node, the tree guides new data instances to a "leaf" node containing the predicted outcome.Main Ideas: Splitting: Decision trees recursively split the data based on the feature that best separates the target variable. This is often done using measures like Gini impurity or information gain. Leaf Nodes: Each leaf node represents a final prediction or classification for a specific combination of feature values. Pruning: To avoid overfitting, branches with low predictive power can be pruned, simplifying the tree. Pros: Interpretability: Easy to understand the logic behind predictions due to the clear decision hierarchy. No feature scaling: Does not require complex data preprocessing for numerical features. Handles diverse data types: Can work with both categorical and numerical data. Cons: Prone to overfitting: Can become too complex and lose accuracy on unseen data. Sensitive to missing values: Imputation or alternative handling strategies are needed. May not capture complex relationships: Not always suitable for highly non-linear problems. Related Popular Algorithms: Random forest: Combines multiple decision trees by randomly sampling features and data points during training, leading to improved accuracy and robustness.
Gradient Boosting: Builds an ensemble of trees sequentially, focusing on correcting the errors of previous trees in the ensemble.
XGBoost: An optimized implementation of gradient boosting known for its speed and efficiency. Additional Notes: Decision trees are powerful tools for initial exploration and understanding of data. Combining decision trees with other algorithms can leverage their strengths while mitigating their weaknesses.

Data Analysis

Tue, 03 Sep 2024 09:19:25 GMT

Data analysis is the core of data science, leveraging computer science and statistics to extract, interpret, and understand information from data. This ranges from data visualization to Machine Learning, deep learning, Graph Theory, Spatial Data Science .Tools and Techniques: Python libraries: Powerful tools like Pandas, NumPy, Matplotlib, and Seaborn are commonly used for data manipulation, analysis, and visualization. R packages: R offers numerous packages for statistical analysis, data visualization, and machine learning. Common Challenges: Big data: Dealing with large and complex datasets requires efficient tools and techniques. Data quality: Ensuring data is accurate, complete, and relevant for analysis is crucial. Model interpretation: Understanding how models work and explaining their results effectively.
Before analysis, it's crucial to clean and preprocess data, handling missing values, outliers, and transforming data to ensure its quality and relevance for analysis. Ethical Considerations: As data scientists, we have a responsibility to use data ethically, considering: Privacy concerns: Protecting sensitive information and respecting user data rights. Bias in data: Understanding and mitigating potential biases present in datasets. Continuous data refers to measurements that can take any value within a given range. Examples include height, weight, and temperature.Discrete data consists of distinct values, often integers. Examples include the number of students in a class or the number of items sold.
Linear correlation measures the degree to which two variables move in relation to each other. The Pearson correlation coefficient is a common measure of linear correlation.
Non-linear correlation is the relationship between two variables where the relationship is not linear. Techniques such as Distance Correlation, Maximal Information Coefficient (MIC), and Kullback-Leibler Divergence (KL) can be used to measure non-linear correlations.Statistical analysis involves using statistical techniques to analyze data. This includes descriptive statistics, inferential statistics, and hypothesis testing. Descriptive statistics summarize and describe the data, while inferential statistics make inferences about populations from samples. Descriptive Statistics: Summarizes key features of data (e.g., mean, median, standard deviation). Inferential Statistics: Allows drawing conclusions about a population based on a sample (e.g., hypothesis testing).
Machine learning is a subset of artificial intelligence that involves the use of algorithms to learn from data and make predictions or decisions without being explicitly programmed. It includes supervised learning, unsupervised learning, and reinforcement learning. Machine learning models can be used for regression, classification, Clustering, and anomaly detection among other tasks. Machine learning: Algorithms that learn from data to make predictions or decisions. Supervised Learning: Trains models to learn from labeled data (e.g., predicting customer churn). Unsupervised Learning: Discovers patterns and relationships in unlabeled data (e.g., grouping customers based on behavior).

Decision Tree vs Hierarchical clustering

Tue, 03 Sep 2024 09:19:25 GMT

Hierarchical Clustering and decision trees are both learning algorithms that aim to uncover patterns and groupings within a dataset. Clustering is unsupervised and decision trees are supervised. However, they differ in their fundamental goals, approaches, and outcomes. Goal: Hierarchical clustering seeks to group data points into a nested hierarchy of clusters based on their similarity. The goal is to discover the underlying structure of the data without any preconceived notions about the number or nature of clusters.
Approach: Hierarchical clustering algorithms iteratively merge or split data points based on their similarity measures, such as Euclidean distance or Jaccard index. This iterative process results in a dendrogram, a tree-like representation of the hierarchical relationships between clusters.
Prior Knowledge of Number of Clusters: Hierarchical clustering does not require prior knowledge of the number of clusters in the data. The dendrogram provides a visual representation of the cluster hierarchy, allowing analysts to identify the appropriate number of clusters based on their understanding of the data and the desired level of granularity.
Outcome: Hierarchical clustering produces a hierarchical structure of clusters, where each level in the hierarchy represents a progressively finer level of granularity. The dendrogram provides a visual representation of the cluster relationships and their evolutionary process.

A dendrogram (right) representing nested clusters (left). (https://www.statisticshowto.com/hierarchical-clustering/) Decision Trees, on the other hand, are supervised learning algorithms that use a tree-like structure to classify data points into predefined categories or classes. They make a series of decisions based on the values of certain features, ultimately reaching a leaf node that represents the predicted class label. Goal: Decision trees are primarily used for classification tasks, where the objective is to assign new data points to one of a known set of categories. They learn from labeled data, where each data point is associated with a class label. Approach: Decision trees recursively partition the data based on feature values, following a top-down approach. At each node, the algorithm selects the feature that best separates the data according to a chosen criterion, such as information gain or Gini impurity. This process continues until all data points belong to leaf nodes, each representing a distinct class. Prior Knowledge of Number of Classes: Decision trees require prior knowledge of the number of classes in the data. This information is essential for the decision-making process at each node. Outcome: Decision trees produce a tree-like structure with labeled leaf nodes, representing the predicted class labels for each data point. The tree can be used to classify new data points by traversing the tree and following the appropriate decision paths based on their feature values.
Hierarchical clustering excels at exploratory data analysis, revealing the hidden structure and relationships within a dataset. It is particularly useful when the number of clusters is unknown or when the data is highly complex and intricate.Decision trees shine in classification tasks, accurately assigning data points to predefined categories. They are widely used in various domains, including fraud detection, medical diagnosis, and recommender systems.In summary, hierarchical clustering and decision trees are both powerful tools for analyzing and understanding data. The choice between the two depends on the specific task at hand. For exploratory data analysis and uncovering hidden patterns, hierarchical clustering is a preferred choice. For classification tasks and assigning new data points to known categories, decision trees are the better option.

UMAP

Tue, 03 Sep 2024 09:18:35 GMT

Original source : https://github.com/lmcinnes/umap UMAP stands for Uniform Manifold Approximation and Projection. It is a dimensionality reduction technique that aims to preserve distance in high dimensional space unto the lower dimension projected space, in general 2D for visualisation.
It is similar to T-SNE, so it is a non linear method for dimension reduction, it aims to represent high-dimensional data in a lower-dimensional space while preserving both local and global structure. However, UMAP utilizes a different mathematical approach than t-SNE, which can lead to different trade-offs and results.
UMAP is based on the concept of constructing a fuzzy topological representation of the high-dimensional data and then optimizing the low-dimensional representation to be as close as possible to this fuzzy topological structure. It leverages ideas from manifold learning, Graph Theory, and Riemannian geometry. In particular it uses a Riemannian manifold as a hypothethical underlying structure.
Math details : Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data:1. The data is uniformly distributed on a Riemannian manifold; 2. The Riemannian metric is locally constant (or can be approximated as such); 3. The manifold is locally connected. From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.To use UMAP in python we have two options :
Using the CPU only with umap-learn , the original library with all the API details and examples.
The Nvidia RapidsAI GPU implementation using GPU only (faster if we have enough memory).

Residual Networks

Tue, 03 Sep 2024 09:10:04 GMT

Visual explanation : ResNet (actually) explained in under 10 minutes
We have here in the toy example of image super-resolution. The ResNet then doesn't need to retain the entire input signal.Residual connections, also known as skip connections, address the problem of vanishing gradients and degradation in very deep neural networks. vanishing gradient problem The vanishing gradient problem, in simple terms, is when the neural network struggles to learn and update its parameters, especially in the earlier layers, due to extremely small gradients. Basic concept: Neural networks learn by adjusting their weights based on the error gradient. This gradient is calculated through backpropagation, flowing backwards from the output layer to the input layer. The problem: In deep networks (those with many layers), as the gradient flows backward, it can become extremely small. By the time it reaches the earlier layers, it's often too tiny to cause any significant updates. As networks become deeper, they can struggle to learn identity mappings and may paradoxically perform worse than shallower networks. Residuals allow the network to learn residual functions with reference to the layer inputs, rather than learning unreferenced functions.The term "unreferenced functions" refers to the traditional approach where the network tries to learn the entire mapping H(x) without explicitly preserving the input information.When we say "residual functions with reference to the layer inputs", we're describing the core idea behind residual networks. Here's a more detailed explanation: In a traditional neural network layer, we try to learn a function H(x) that maps input x to some desired output. In a residual network, instead of trying to learn H(x) directly, we reformulate the problem. We try to learn a function F(x) such that: H(x) = F(x) + x Here, F(x) is called the residual function. The "reference" in this case is the input x. We're learning F(x) with respect to (or in reference to) the input x, rather than trying to learn the entire transformation H(x) from scratch. This means that the network only needs to learn the difference (or residual) between the input and the desired output, rather than the full transformation.
Residuals work because they provide a direct path for gradients to flow backward through the network during backpropagation. This helps mitigate the vanishing gradient problem. Additionally, they allow the network to easily learn identity mappings by "skipping" layers if needed. This means that adding more layers doesn't hurt performance, as the network can choose to use or ignore these extra layers.To use residual connections correctly: Add them between layers or blocks of layers in your network. Ensure that the dimensions of the input and output match. If they don't, use a 1x1 convolution to adjust the dimensions. Use them in combination with batch normalization and ReLU activations for best results. Be careful not to overuse them, as this can lead to diminishing returns. Residuals are used in what is called a ResNet Block, which architecture is as follows :

To match dimensions, 1x1 Convolutions are used to increase or decrease channels.Let's dive deeper into how 1x1 convolutions are used to adjust the number of channels in residual blocks. Basic concept of 1x1 convolutions: A 1x1 convolution is essentially a linear transformation applied to each spatial location independently. It operates on a single pixel at a time across all input channels. Channel adjustment: The number of filters in the 1x1 convolution determines the number of output channels. If you want to increase channels: use more filters than input channels. If you want to decrease channels: use fewer filters than input channels. How it works in residual blocks: Let's say we have an input with shape channels. Our residual function outputs channels. If , we use a 1x1 convolution with filters on the shortcut path. If spatial dimensions also need to change, you can use a stride > 1 in the 1x1 convolution.
Practical example : U-net architecture

Segment Anything Model 2

Mon, 02 Sep 2024 08:23:37 GMT

inverse_mapping_latent_space

Tue, 27 Aug 2024 13:16:09 GMT

latent_space_arithmetic

Tue, 27 Aug 2024 13:07:12 GMT

Pasted image 20240827150646

Tue, 27 Aug 2024 13:06:46 GMT

Pasted image 20240827150554

Tue, 27 Aug 2024 13:05:54 GMT

Wasserstein_interpolation

Tue, 27 Aug 2024 12:33:49 GMT

smile_vector_latent

Tue, 27 Aug 2024 12:06:11 GMT

Visualization of the DeepSDF latent space using t-SNE

Tue, 27 Aug 2024 09:51:44 GMT

Pasted image 20240827114518

Tue, 27 Aug 2024 09:45:18 GMT

t-sne_clothing

Tue, 27 Aug 2024 09:34:06 GMT

Autoencoders

Tue, 27 Aug 2024 09:07:02 GMT

AE_latent

Tue, 27 Aug 2024 09:03:28 GMT

U-net architecture

Tue, 27 Aug 2024 08:33:00 GMT

Visual explanation : The U-Net (actually) explained in 10 minutes
Use cases : Semantic Segmentation, Diffusion Models, Super-Resolution
The U-Net architecture is in essence an Encoder-Decoder structure similar to an AutoEncoder, but using convolutional layers. The green lines are the residuals linking each encoder block to its decoder counter part, which are simply concatenated to the decoder feature maps before upscaling.This is essential to get a crisp reference to the original pixels location and values, greatly improving the results of a simpler convolutional Encoder-Decoder architecture.

U-net

Mon, 26 Aug 2024 14:48:43 GMT

1x1 Convolutions

Mon, 26 Aug 2024 14:41:36 GMT

What is a 1x1 convolution?A 1x1 convolution, also known as a channel-wise convolution or spatial convolution with a kernel size of 1x1, is a type of convolutional layer in a neural network. It is a basic building block of convolutional neural networks (CNNs) and is used to perform feature extraction and transformation.In a standard convolutional layer, the kernel (or filter) slides over the input data with a specified stride and padding, performing a dot product between the kernel and the input data to produce an output feature map. The kernel size determines the number of input data elements that are involved in the computation.In a 1x1 convolution, the kernel size is 1x1, which means that the kernel only involves a single input data element at a time. This type of convolution is essentially a point-wise multiplication, where each element of the input data is multiplied by a learnable weight and then passed through an activation function.The 1x1 convolution has several key properties: Spatially invariant: The kernel is applied independently to each spatial location, making it spatially invariant. Channel-wise: The kernel operates on a channel-wise basis, meaning that it transforms each channel of the input data independently. Point-wise: The kernel is applied point-wise, meaning that it involves only a single input data element at a time.

resnet_block

Mon, 26 Aug 2024 14:17:30 GMT

residuals_image

Mon, 26 Aug 2024 14:12:50 GMT

RAG simple, local et Open-Source avec GPT4All

Wed, 17 Jul 2024 12:21:07 GMT

Ce guide fait suite à Installer une LLM en local pour un humain local . Au moment de l'écriture de ce tuto, les fonctionnalités de Retrieval-Augmented Generation de Jan sont moins développées que GPT4All , que l'on va donc utiliser.
En bref, le RAG est une méthode pour améliorer le résultat d'une requête à un Large Language Model à l'aide de documents complémentaires stockés dans une base, par exemple une bibliographie scientifique, sans avoir à fine-tune le modèle.
Si vous n'avez pas de GPU Nvidia, ignorer cette étape. On peut faire tourner le modèle uniquement sur processeur sans problème.Sinon,
Installer l'outil CUDA Toolkit pour pouvoir exploiter la GPU avec CUDA en suivant les étapes suivantes :
Suivre les instructions de l'installer .exe après avoir saisi un mot de passe administrateur. Cliquer sur suivant jusqu'à cette partie et cocher la case :
Suivre les instructions jusqu'à la fin de l'installation. Voilà, vous pouvez à présent exploiter les capacités de votre carte graphique pour des modèles de deep learning !
Aller sur la page : https://www.nomic.ai/gpt4all et télécharger l'installeur adapté à votre système d'exploitation.Lors de l'affichage de la première fenêtre, cliquer sur Reglages en bas à gauche :
afin de configurer le proxy HTTP : cache.univ-st-etienne.fr. Remplacer le port 0 par le port effectif utilisé dans votre réseau.
Dossier d'installation par défaut dans votre dossier User : C:\Users\username\gpt4all , modifier si vous voulez l'installer ailleurs que dans C:\Suivez les étapes jusqu'à la fin : Si le proxy est correctement configuré, le téléchargement des données se fait par internet. Sinon, l'installation peut se faire entièrement en local également.Fin de l'installation, vous allez pouvoir commencer à utiliser GPT4All ! 🎉
Documentation officielle de démarrage L'interface de l'accueil se présente comme ceci :
L'onglet Chats permet de discuter avec des LLM installées localement, ou des modèles distants en spécifiant une clé d'API.LocalDocs permet de réaliser l'embedding vectoriel de documents textuels pour faire du RAG.
Find Models permet d'accéder à des repository HuggingFace pour télécharger de nouveaux modèles.Lors d'une première utilisation, aucun modèle n'est installé, il est donc nécessaire d'en installer un en cliquant sur Install a Model :
Il est recommandé d'installer par défaut LLama 3 de Meta ou Nous Hermes 2, qui fonctionnent bien en anglais. Note Cette interface avec HuggingFace de téléchargement a tendance à s'arrêter en plein milieu, dans ce cas, il faut relancer gpt4All. Je ne sais pas exactement d'où vient le problème, vu que le proxy est correctement configuré. Note L'installation d'un modèle à partir d'HuggingFace pour télécharger d'autres modèles (multilingues par ex.) directement à la source sera traité ultérieurement, mais est plus fiable en terme de stabilité du téléchargement.
Une fois un modèle de LLM installé, on peut commencer à converser avec dans l'onglet Chat.L'interface de chat se présente comme ceci :
Sur la gauche, nous avons l'historique de conversations enregistré localement sur votre machine dans le dossier spécifié lors de l'installation (modifiable dans les paramètres), ici par défaut : C:\Users\username\AppData\Local\nomic.ai\GPT4All dans les fichiers .chatOn peut modifier le titre, supprimer les conversations à notre guise. Le bouton New Chat permet de commencer une nouvelle conversation. A sa droite, un bouton permet de masquer l'historique de chat ou non.Le plus important ici étant le bouton permettant le choix de modèle, qui est une liste déroulante permettant de changer avec quel modèle on converse :
Les deux boutons à gauche du nom du modèle permettent de recharger le modèle, par exemple lors de modification de paramètres spécifiques de modèles, et d'éjecter le modèle de la mémoire GPU ou CPU.
Mais ce qui va nous intérésser ici c'est surtout la fonctionnalité de Retrieval-Augmented Generation , implémentée dans le plugin interne LocalDocs, qui permet facilement de transformer une base de connaissance texte en base de donnée vectorielle sémantique.L'interface simple permet d'ajouter de nouveaux dossiers contenant des fichiers textes pour réaliser un embedding automatique avec le modèle nomic-embed-text-v1.5 fourni avec le logiciel.
En cliquant sur Add Collection :
Vous pouvez ajouter un dossier avec des fichiers .txt, .pdf, ou markdown .md. L'arboresence de dossier est parcourue de manière récursive en entier donc vous pouvez mettre autant de sous-dossiers que vous voulez.Une fois la collection chargée, le modèle d'embedding commencera à générer la base de données vectorielles, qui se mettra à jour automatiquement à chaque modification du dossier. Important Le processus d'embedding est nettement plus rapide avec une carte graphique, mais est possible avec un CPU. On verra dans la suite comment paramétrer le GPU. Et voilà ! Une fois l'embedding terminé, vous pourrez discuter avec votre base de texte dans le chat, en selectionnant le dossier qui vous intéresse, et poser des questions relatives aux documents, ici ma bibliographie scientifique locale :
Nous allons voir ici les éléments de configuration importants :⚡ Onglet Application :
Dans Device, selectionner votre carte graphique CUDA si vous en avez une, sinon laissez sur auto. Vous pouvez aussi contribuer au Datalake Open-Source de Nomic.ai qui a développé GPT4All. Ce Datalake permet d'entraîner de futurs modèles et est accessible à tous pour des usages personnels. C'est ici également que vous pouvez modifier le dossier racine de GPT4All.Huggingface 🤗 C'est aussi dans ce dossier que vous pouvez copier n'importe quel modèle au format .GGUF téléchargé depuis HuggingFace pour l'importer dans GPT4All. 🛠Dans l'onglet Model :vous pouvez modifier les paramètres spécifiques de chaque modèle. Les modifications peuvent entraîner une instabilité des réponses du modèle, à modifier avec précaution.📄 Dans l'onglet LocalDocs :
Allowed File Extensions permet d'ajouter des extensions lisibles par le modèle d'embedding. (J'ignore quels autres formats sont acceptés)L'embedding peut se faire en ligne avec un modèle Nomic AI à travers leur API plutôt qu'avec le modèle local, pour être sauvegardé sur le cloud de votre compte utilisateur Nomic.Le paramètre important ici : Embeddings Device -> Choisir votre GPU CUDA pour réaliser les embeddings avec la GPU plutôt que le CPU. Par défaut, c'est avec le CPU. 📄Cochez Show Sources si ce n'est pas déjà fait, ce qui permet au modèle de citer dans quel document est pioché le fragment (snippet) de document.
Pour plus d'informations, la documentation GPT4All sera plus exhaustive.
Step 0 : Aller sur https://huggingface.co/ Step 1 : écrire le nom de votre modèle (+ gguf si ça n'apparait pas) , par exemple pour un modèle multilingue comme Aya-23 :
Choisir la taille souhaitée, ici 8B est adapté à ma GPU, avec le suffixe GGUF de l'utilisateur bartowski.
Sur la page il y a un tableau récapitulatif de la qualité du modèle selon son niveau de compression par quantification des poids (Quantization) . Vous pouvez télécharger le modèle qui vous semble approprié selon votre mémoire GPU disponible.Espace mémoire La taille du modèle ne correspond pas forcément à l'espace utilisé en VRAM GPU, cela varie selon le nombre de couches chargées en mémoire et du kernel python utilisé pour l'inférence. En général, la quantification sur 4 bit Q4_K_M est la plus répandue. Download a file (not the whole branch) from below:Cliquer sur le lien du modèle et puis cliquer sur download :
Une fois le téléchargement fini, copier le fichier dans le dossier source de GPT4ALL : C:/Users//AppData/Local/nomic.ai/GPT4All/Il suffit alors de redémarrer l'application pour pouvoir utiliser ce nouveau modèle !On peut le voir dans les modèles (ici ma version est différente de celle au dessus) :

Memoire gpu On peut voir que le modèle quantifié sur 6 bits ici qui pèse 6.14 Go, occupe lors de l'execution d'inférence en réalité 9,3 Go. C'est la raison pour laquelle on utilise souvent des modèles plus légers pour qu'ils aient une taille inférieure à 8 Go, qui est la capacité mémoire GPU usuelle pour un consommateur lambda. On peut alors tester la capacité multilinguistique du modèle :

Vous êtes arrivés à la fin de ce guide ! Bravo, vous pouvez à présent exploiter pleinement les capacités de votre LLM Open-Source en local pour discuter avec vos documents !Les fonctionnalités de GPT4All ne sont pas limités à ça, pour plus d'information, se référer à leur documentation officielle.

aya_banana

Wed, 17 Jul 2024 12:19:00 GMT

Aya-23_demo

Wed, 17 Jul 2024 12:16:10 GMT

gpu_mem

Wed, 17 Jul 2024 12:11:26 GMT

aya_loaded

Wed, 17 Jul 2024 12:10:17 GMT

dl_hugging_face_gguf

Wed, 17 Jul 2024 12:03:26 GMT

HF_tuto

Wed, 17 Jul 2024 10:01:53 GMT

localdocs_settings

Wed, 17 Jul 2024 09:47:40 GMT

settings_gpt4all

Wed, 17 Jul 2024 09:36:32 GMT

Pasted image 20240717112831

Wed, 17 Jul 2024 09:28:31 GMT

add_document_localdocs

Wed, 17 Jul 2024 09:22:15 GMT

localdocs_param

Wed, 17 Jul 2024 09:19:41 GMT

choix_modèle_gpt4all

Wed, 17 Jul 2024 09:13:28 GMT

chat_demo

Wed, 17 Jul 2024 09:08:00 GMT

llama_3install

Tue, 16 Jul 2024 09:52:14 GMT

Install_model_first

Tue, 16 Jul 2024 09:48:11 GMT

gpt4all_interface

Tue, 16 Jul 2024 09:40:53 GMT

proxy_gpt4all

Tue, 16 Jul 2024 09:35:09 GMT

config_gpt4all

Tue, 16 Jul 2024 09:19:30 GMT

rag_schema

Tue, 16 Jul 2024 09:14:57 GMT

Approximating XGBoost with an interpretable decision tree

Tue, 16 Jul 2024 09:03:18 GMT

This paper references how to simplify a forest of decision trees into a single decision tree without much loss. In particular, it explores this application to XGBoost.
The paper “Approximating XGBoost with an interpretable decision tree” presents a novel method for transforming a decision forest of any kind into an interpretable decision tree 1 2. This method is particularly useful for machine learning practitioners who want to exploit the interpretability of decision trees without significantly impairing the predictive performance gained by Gradient Boosting Decision Trees (GBDT) models like XGBoost 1 2.Main Ideas:
The paper addresses the increasing need for interpretable machine-learning models, especially in critical domains like healthcare and finance where the model consumer must understand the rationale behind the model output to make a decision 1.
Despite the superior predictive performance of GBDT models, they cannot be used in tasks that require transparency 1.
The authors propose a method to convert a GBDT model into a single decision tree 2. This allows for better transparency of the outputs while approximating the predictive performance of a XGBoost model 1 3. Performance:
The generated tree approximates the accuracy of its source forest 2.
In some cases, the generated tree is able to approximate the predictive performance of a XGBoost model 3.
The generated tree outperforms CARET induced trees in terms of predictive performance 2. Limitations:
The complexity of the tree can be configured by the method user 2. This implies that the interpretability and performance of the generated tree could vary based on the configuration, which could be a limitation in certain scenarios.
For more detailed information, I recommend reading the full paper 1.

Setting up a RAG using Gemma 7b model on Colab

Tue, 16 Jul 2024 09:01:51 GMT

Retrieval-Augmented GenerationBefore running any model from the hugging face API, we need an API Key. To do that, generate an access token on your HuggingFace profile and copy it in the Notebook Secrets (Secure API keys stash from the google Colab interface). you can then access it using the following command :from google.colab import userdata userdata.get('secretName')

Running jupyter or IDE on WSL2

Thu, 11 Jul 2024 09:58:01 GMT

First, if you want to install apps through snapd, configure proxy settings in the snap configuration :(rapids-23.10) brazma@emptre206178:~$ sudo snap set system proxy.http="http://cache.univ-st-etienne.fr:3128" (rapids-23.10) brazma@emptre206178:~$ sudo snap set system proxy.https="http://cache.univ-st-etienne.fr:3128" (rapids-23.10) brazma@emptre206178:~$ sudo snap install dataspell --classic You can then install DataSpell and run a Jupyter server manually following this guide if it doesn't work by default :
Jupyter runs the user's code in a separate process called kernel. The kernel can be a different Python installation (in a different conda environment or virtualenv or Python 2 instead of Python 3) or even an interpreter for a different language (e.g. Julia or R). Kernels are configured by specifying the interpreter and a name and some other parameters (see Jupyter documentation) and configuration can be stored system-wide, for the active environment (or virtualenv) or per user. If nb_conda_kernels is used, additional to statically configured kernels, a separate kernel for each conda environment with ipykernel installed will be available in Jupyter notebooks.In short, there are three options how to use a conda environment and Jupyter:Do something like:conda create -n my-conda-env # creates new virtual env conda activate my-conda-env # activate environment in terminal conda install jupyter # install jupyter + notebook jupyter notebook # start server + kernel inside my-conda-env Jupyter will be completely installed in the conda environment. Different versions of Jupyter can be used for different conda environments, but this option might be a bit of overkill. It is enough to include the kernel in the environment, which is the component wrapping Python which runs the code. The rest of Jupyter notebook can be considered as editor or viewer and it is not necessary to install this separately for every environment and include it in every env.yml file. Therefore one of the next two options might be preferable, but this one is the simplest one and definitely fine.Do something like:conda create -n my-conda-env # creates new virtual env conda activate my-conda-env # activate environment in terminal conda install ipykernel # install Python kernel in new conda env ipython kernel install --user --name=my-conda-env-kernel # configure Jupyter to use Python kernel Then run jupyter from the system installation or a different conda environment:conda deactivate # this step can be omitted by using a different terminal window than before conda install jupyter # optional, might be installed already in system e.g. by 'apt install jupyter' on debian-based systems jupyter notebook # run jupyter from system Name of the kernel and the conda environment are independent from each other, but it might make sense to use a similar name.
Only the Python kernel will be run inside the conda environment, Jupyter from system or a different conda environment will be used - it is not installed in the conda environment. By calling ipython kernel install the jupyter is configured to use the conda environment as kernel, see Jupyter documentation and IPython documentation for more information. In most Linux installations this configuration is a *.json file in ~/.local/share/jupyter/kernels/my-conda-env-kernel/kernel.json:{ "argv": [ "/opt/miniconda3/envs/my-conda-env/bin/python", "-m", "ipykernel_launcher", "-f", "{connection_file}" ], "display_name": "my-conda-env-kernel", "language": "python" }
When the package nb_conda_kernels is installed, a separate kernel is available automatically for each conda environment containing the conda package ipykernel or a different kernel (R, Julia, ...).conda activate my-conda-env # this is the environment for your project and code conda install ipykernel conda deactivate conda activate base # could be also some other environment conda install nb_conda_kernels jupyter notebook You should be able to choose the Kernel Python [conda env:my-conda-env]. Note that nb_conda_kernels seems to be available only via conda and not via pip or other package managers like apt.If the auto-starting of a jupyter server on DataSpell doesn't work with an error message along these lines :[...] Warn(f"{klass} is not importable. Is it installed?", ImportWarning) TypeError: warn() missing 1 required keyword-only argument: 'stacklevel' Follow these steps : Once you managed to run a jupyter server in a terminal following one of the previous options, add a jupyter server in the Jupyter Servers settings like so :
You will see on the left panel the jupyter server being added as it was a project. Run your files from here instead of the local folder (Biotope_DA here). PS : apparently it is possible to run a python environment from the Windows DataSpell using WSL2 interpreter, but i didn't manage to make it work and this is the best working solution so far. Once correctly connected, choose the conda environment kernels to run on the rapids-ai conda environment :
You can check in the bottom right corner which conda environment your code is using, which should be something like rapids-23.10

engine_params_jan

Thu, 11 Jul 2024 09:18:30 GMT

paramètres de modèle

Thu, 11 Jul 2024 09:16:09 GMT

chat_jan

Thu, 11 Jul 2024 09:13:22 GMT

proxy_params

Thu, 11 Jul 2024 09:01:55 GMT

fiches_models

Thu, 11 Jul 2024 08:45:59 GMT

hub ja

Thu, 11 Jul 2024 08:35:32 GMT

jan_download

Tue, 09 Jul 2024 15:17:40 GMT

install_Jan

Tue, 09 Jul 2024 15:16:44 GMT

visualstudio_cuda

Tue, 09 Jul 2024 15:08:45 GMT

cuda_toolkit

Tue, 09 Jul 2024 15:00:11 GMT

Scale-space

Mon, 08 Jul 2024 15:48:54 GMT

Model metrics

Mon, 08 Jul 2024 10:08:12 GMT

For imbalanced datasets that are spatially distributed, you have several options for scoring metrics that might be more appropriate than accuracy. Here are some alternatives you could consider: Balanced Accuracy: scoring='balanced_accuracy' This is the average of recall obtained on each class. F1 Score: scoring='f1' For binary classification. For multi-class, you can use 'f1_micro', 'f1_macro', or 'f1_weighted'. F1 Score for Multi-class Classification: F1 Micro: Calculation: Compute F1 globally by counting total true positives, false negatives, and false positives across all classes. Use Case: Preferred when you want to weight each instance equally. Scoring parameter: 'f1_micro' F1 Macro: Calculation: Compute F1 for each class independently and then take the unweighted mean. Use Case: Gives equal importance to each class, regardless of its frequency. Scoring parameter: 'f1_macro' F1 Weighted: Calculation: Compute F1 for each class and take the average weighted by the number of instances in each class. Use Case: Accounts for class imbalance while still giving all classes some importance. Scoring parameter: 'f1_weighted' ROC AUC: scoring='roc_auc' Area Under the Receiver Operating Characteristic curve. For multi-class, use 'roc_auc_ovr' or 'roc_auc_ovo'. ROC AUC for Multi-class Classification: ROC AUC OvR (One-vs-Rest): Calculation: Compute the ROC AUC for each class against all others, then average the results. Use Case: When you're interested in the model's ability to distinguish each class from the rest. Scoring parameter: 'roc_auc_ovr' ROC AUC OvO (One-vs-One): Calculation: Compute the ROC AUC for each pair of classes, then average the results. Use Case: When you're interested in the model's ability to distinguish between each pair of classes. Scoring parameter: 'roc_auc_ovo' ROC AUC OvR Weighted: Similar to ROC AUC OvR, but weighted by the class frequency. Scoring parameter: 'roc_auc_ovr_weighted' Precision-Recall AUC: scoring='average_precision' Area under the precision-recall curve. Matthews Correlation Coefficient: scoring='mcc' A balanced measure that can be used even if the classes are of very different sizes. Cohen's Kappa: scoring='cohen_kappa_score' Measures agreement between predicted and observed categorizations. For spatially distributed data, you might also consider: Spatial Cross-Validation: Instead of changing the scoring metric, you could implement a spatial cross-validation strategy to ensure your model generalizes well across different spatial regions. Custom Spatial Metric: You could create a custom scoring function that incorporates spatial information, such as weighting errors based on spatial autocorrelation. To use multiple metrics simultaneously:from sklearn.metrics import make_scorer, f1_score, balanced_accuracy_score scoring = { 'f1': 'f1', 'balanced_accuracy': 'balanced_accuracy', 'roc_auc': 'roc_auc' } halving_search = HalvingRandomSearchCV( estimator=clf, param_distributions=param_dist, cv=10, resource="n_estimators", max_resources=1000, min_resources=50, factor=2, verbose=2, scoring=scoring, refit='f1', # Choose which metric to use for selecting the best model random_state=42 ) When using multiple metrics, you'll need to specify which one to use for selecting the best model with the refit parameter.Would you like me to elaborate on any of these metrics or provide more information on implementing spatial cross-validation?

CarteMonde_FR

Thu, 27 Jun 2024 08:29:30 GMT

Organisation territoriale

Thu, 27 Jun 2024 08:27:18 GMT

organigramme_cnrs

Thu, 27 Jun 2024 08:21:13 GMT

CNRS_Organigramme2024

Thu, 27 Jun 2024 08:19:20 GMT

Pasted image 20240626175721

Wed, 26 Jun 2024 15:57:21 GMT

Investir_avenir

Wed, 26 Jun 2024 15:55:01 GMT

Composantes

Wed, 26 Jun 2024 15:50:00 GMT

Pasted image 20240626174930

Wed, 26 Jun 2024 15:49:30 GMT

plateformes

Wed, 26 Jun 2024 15:46:50 GMT

Pasted image 20240626174546

Wed, 26 Jun 2024 15:45:46 GMT

Conseils

Wed, 26 Jun 2024 15:45:27 GMT

labex

Wed, 26 Jun 2024 15:07:58 GMT

tutelles

Wed, 26 Jun 2024 14:55:33 GMT

Pasted image 20240626165512

Wed, 26 Jun 2024 14:55:12 GMT

logo_CNRS

Wed, 26 Jun 2024 14:51:43 GMT

Ateliers_EVS

Wed, 26 Jun 2024 14:23:25 GMT

Logo EVS

Wed, 26 Jun 2024 14:12:41 GMT

Shapley Value

Wed, 26 Jun 2024 13:59:15 GMT

Shapley Value in a NutshellThe Shapley value, borrowed from cooperative game theory, helps us understand the fair allocation of credit (or blame) among players (features) in a collaborative game (machine learning model). It calculates the average marginal contribution of each feature to the model's prediction across all possible feature combinations.More detail : Shapley Values ExplainedKey Points for ML Scientists: Feature Contributions: The Shapley value for a feature represents its average impact on the model's prediction when added to a random subset of other features already considered. Feature Interactions: Unlike simpler methods that might overestimate or underestimate the importance of features based on their order, Shapley values account for feature interactions by considering all possible feature combinations. Local vs. Global: Shapley values are typically calculated for individual data points (local interpretability), providing insight into how features influence a specific prediction. However, they can also be averaged across the dataset for a more global understanding of feature importance. How Shapley Values Work: Features as Players: Imagine each feature in your model as a player in a game. Coalitional Games: Consider all possible combinations (coalitions) of features, from an empty set (no features) to the full set (all features). Marginal Contribution: For each feature and each coalition, calculate the difference in the model's prediction when adding the feature to the existing coalition compared to the prediction without it. This represents the feature's marginal contribution in that specific context. Weighted Average: Average the marginal contributions of a feature across all possible coalitions, weighted by the size (number of features) of each coalition. This weighted average is the Shapley value for that feature. Benefits for Interpretability: Feature Importance: Shapley values provide a principled way to assess the relative importance of features in influencing the model's prediction for a specific data point. Feature Interaction Insights: By considering all combinations of features, Shapley values offer a more nuanced understanding of how features interact to produce the final prediction. Limitations to Consider: Computational Cost: Calculating Shapley values can be computationally expensive for complex models with many features, especially when exact methods are used. Approximation techniques are often employed to address this. Model-Agnostic vs. Model-Specific: Shapley values are generally model-agnostic, meaning they can be applied to various machine learning models. However, for deeper understanding, incorporating model-specific knowledge can sometimes be beneficial. Overall, Shapley values are a valuable tool for interpretability in machine learning, providing insights into feature importance and interactions that can aid in debugging models, understanding decision-making processes, and improving fairness and trust in ML applications.

1x1_conv

Thu, 13 Jun 2024 12:52:49 GMT

Grokking

Thu, 13 Jun 2024 11:51:55 GMT

GROKKING: GENERALIZATION BEYOND OVERFITTING ON SMALL ALGORITHMIC DATASETS
Towards Understanding Grokking: An Effective Theory of Representation Learning"Grokking" is a term coined by computer scientist Peter Norvig, and it refers to the process of a machine learning model (such as a neural network) suddenly and unexpectedly understanding a complex pattern or concept in the data, often in a way that is surprising and difficult to explain.The term "grokking" is derived from a science fiction novel called "Stranger in a Strange Land" by Robert A. Heinlein, in which the word "grok" means to understand something deeply and intuitively, without needing to analyze or rationalize it.In the context of machine learning, grokking refers to the phenomenon where a model learns to recognize a pattern or relationship in the data that it was not explicitly trained on, and uses that understanding to generalize to new, unseen data. This can happen when the model has been trained on a large and diverse dataset, or when it has been given a large number of iterations or computational resources to learn.Grokking is often associated with the concept of "understanding" or "insight" in machine learning, and is seen as a desirable outcome because it allows the model to make predictions or decisions that are more accurate and robust than would be possible through simple memorization or pattern recognition. However, grokking is still a poorly understood phenomenon, and researchers are still working to understand the mechanisms that underlie it.

The "double descent" phenomenon in deep learning refers to the observation that the generalization performance of a neural network can exhibit a non-monotonic behavior as a function of the model's capacity, measured by the number of parameters or the depth of the network.In general, increasing the capacity of a neural network can lead to better generalization performance, as it allows the model to learn more complex patterns in the data. However, beyond a certain point, further increasing the capacity can actually lead to a decrease in generalization performance, as the model becomes overfitting and starts to memorize the training data rather than learning generalizable patterns.The double descent phenomenon is a more nuanced phenomenon, where the generalization performance of the model exhibits two distinct phases: First descent: As the capacity of the model increases, the generalization performance improves, following the typical pattern of underfitting to overfitting. This is the expected behavior, where the model learns to fit the training data better as it becomes more complex. Second ascent: However, beyond a certain point, the generalization performance of the model increases again, despite the increased capacity. This is the "double descent" phenomenon, where the model learns to generalize better to new data even though it has more parameters.
Weight decay + SGD => faster grokking

double_descent_wikipedia

Thu, 13 Jun 2024 09:58:18 GMT

Double descent

Thu, 13 Jun 2024 09:48:25 GMT

Monosemanticity in LLMs

Tue, 04 Jun 2024 09:54:07 GMT

Paper links : Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 SonnetThe goal of these papers is to explore the feature space of the learned representations of a LLM and map the internal "mind" of the network. Given the hypothesis that a single neuron encodes for multiple features, called polysemanticity (learned features are a combination of multiple neurons activation), they seek the cases where a single neuron encodes a single concept or feature.
One potential cause of polysemanticity is superposition, a hypothesized phenomenon where a neural network represents more independent "features" of the data than it has neurons by assigning each feature its own linear combination of neurons.To achieve that analysis, they employ multiple techniques, such as dictionary learning and sparse autoencoders.

EVS-newUJM

Wed, 15 May 2024 14:21:04 GMT

EVS-newUJM

Wed, 15 May 2024 14:20:10 GMT

pci_express

Tue, 30 Apr 2024 09:35:06 GMT

nvidia_quadro_rtx4000

Thu, 25 Apr 2024 13:45:24 GMT

dask_client

Thu, 25 Apr 2024 13:31:37 GMT

Nvidia_Rapids

Thu, 25 Apr 2024 13:18:19 GMT

Xarray

Thu, 25 Apr 2024 13:16:37 GMT

Multicollinearity in data

Thu, 18 Apr 2024 14:22:18 GMT

Multicollinearity: It refers to the situation in which two or more variables in a dataset are highly correlated. In a linear regression model, this would lead to instability in the estimate of coefficients, and interpretation of the model becomes problematic.How gradient boosting works: When constructing a tree in the gradient boosting model, the algorithm is making a series of decisions. At each node, it decides what variable and what split point leads to the largest reduction in the impurity measure (such as Gini impurity or entropy).Why gradient boosting is not affected: Each split is made independently of other splits, and as such, the algorithm does not concern itself with the relationships between different variables. So even if two variables are highly correlated, it does not create the same problems as in linear regression. The algorithm will simply choose to split on the variable that provides the most predictive power.However, please note that even though gradient boosting methods are not typically affected by multicollinearity, the presence of correlated predictors can still lead to two potential problems: interpretation difficulty and overfitting. Feature importance could be incorrectly attributed in the presence of highly correlated features. As for overfitting, although tree-based models are less prone to it, boosting methods can indeed overfit with highly correlated features if not correctly regularized and trained.Deep learning models, much like gradient boosting models, are not significantly impacted by multicollinearity. These models are primarily designed to handle complex, high-dimensional data, so the interrelationships between predictors do not pose the same issues as they might in regression analysis. In fact, deep learning can accommodate and even capitalize on relationships between variables in order to uncover complex patterns. However, similar to other models, the presence of redundant inputs can still lead to overfitting, and interpreting such models can be challenging, especially when highly correlated inputs are present. It can also inflate the size of the model with no real benefit.

Encoding No data

Thu, 18 Apr 2024 09:29:46 GMT

The question of data encoding using the correct data type is a tricky one. In particular, there are multiple ways to encode missing values, from +/-inf values to NaN (Not a Number) values.In particular, data processing analysis librairies such as Pandas or Numpy can use different encoding.While pandas supports storing arrays of integer and boolean type, these types are not capable of storing missing data. Until we can switch to using a native NA type in NumPy, we’ve established some “casting rules”. When a reindexing operation introduces missing data, the Series will be cast according to the rules introduced in the table below.Pandas (pd for short) or Numpy (np for short), have different implementation for null values. See this : Pandas missing data Since NaN or inf cannot be encoded in int data type, it will convert the data type to float or object if your data contains an np.nan / NaN . In other words, no data is represented by numpy.nan for NumPy data types. The disadvantage of using NumPy data types is that the original data type will be coerced to np.float64 or object.pandas.NA is noted , while there also exists None but equality is not the same as with np.nan :In [14]: None == None # noqa: E711 Out[14]: True In [15]: np.nan == np.nan Out[15]: False In [16]: pd.NaT == pd.NaT Out[16]: False In [17]: pd.NA == pd.NA Out[17]: pd.NA is still experimental and may change in the future. It is included in the nullable types for integers and floating point numbers : Int32 and Int64 , Float32 and Float64
Starting from pandas 1.0, an experimental NA value (singleton) is available to represent scalar missing values. The goal of NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).
more info : Understand the NaN and None difference in Pandas once for all np.nan allows for vectorized operations; its a float value, while None, by definition, forces object type, which basically disables all efficiency in numpy.
the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, see NA type promotions.The distinction between None and NaN in Pandas can be summarized as: None represents a missing entry, but its type is not numeric. So any column (ad Pandas Series) that contains a None value is definitely not a numeric type, such as int or float. NaN which stands for not-a-number, is on the other hand a numeric type. This means that NaN can be found in a numeric column of int or float type. In short, always use np.nan by default

Data types for Spatial Data Science

Wed, 17 Apr 2024 13:51:05 GMT

There are many datatypes in the wild that can be used to encode numerical values or categories. In our case, we have 3 main types of data : Categorial data for biotope labels : "312474" as a String . We also have with pandas .astype('category') to convert to categorical values directly. Integers for categorical numerical data : exposition for example goes from 0 to 9, that can be represented as Integers Floating point values for continuous variables such as angot : we use Float To encode measurable numbers, we have to separate into two main categories : Integers : Unsigned > 0 or Signed (+ & -), used to represent integer variables in Floating point numbers : represents continuous variables with floating point decimals in To encode text, a single symbol is represented as a char and a sequence of characters is represented as String which is technically an array of char.
For one byte/octet (8 bits), we have 256 possible values, and most of the symbols we use can be represented using 256 values in what's called the ASCII Table.Here's the completed list for floating-point types, string, and char:Floating-point details: Float32 and Float64 represent numbers with decimal points but have limited precision.
Their exact representable range depends on the system architecture (32-bit vs 64-bit) and the IEEE 754 floating-point standard. The provided ranges are a common approximation for these data types.
In practice however, the encoding data types vary by programming language, country and other constraints. Some data librairies such as Pandas offer new data types, such as Int32 (opposed to int32) that includes NaN values aswell, allowing sparse DataFrames to be encoded. Pandas also offer a category datatype for categorical data.
Nominal categories are what you cannot order and is a qualitative measurement. Sensitive measurements such as color, smell can be subjective and impossible to quantify. for example : animal type = (cat,dog,bird) is a nominal categorical variable.
They are the most difficult type of data for machine learning as their encoding usually involves one-hot encodingthat takes up a lot of space and is not practical with some algorithms such as decision trees. Other ways to encode nominal entities exist such as Label encoding, Target encoding and others.Ordinal data is a qualitative variable that has a hierarchical meaning. Ranking mood for example from 1 to 5 means that person A having mood=2 is less happy than person B having a mood=4 so that . As such, it is easier to encode into a integer value as the inherent order within the numbers reflects the variable, contrary to nominal data.

Spearman's Correlation

Mon, 15 Apr 2024 09:53:20 GMT

Spearman's correlation, also called Spearman's rank correlation coefficient, is a statistical measure used to assess the monotonic relationship between two variables. Unlike Pearson's correlation, which looks for linear relationships, Spearman's correlation is non-parametric, meaning it doesn't make assumptions about the data distribution.
Wikipedia def :
In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter Rho.It is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function.
The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson's correlation assesses linear relationships, Spearman's correlation assesses monotonic relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.
Intuitively, the Spearman correlation between two variables will be high when observations have a similar (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and low when observations have a dissimilar (or fully opposed for a correlation of −1) rank between the two variables.
Spearman's coefficient is appropriate for both continuous and discrete [ordinal variables](https://en.wikipedia.org/wiki/Ordinal_variable "Ordinal variable" Both Spearman's � and Kendall's �

{\displaystyle \tau } can be formulated as special cases of a more general correlation coefficient.{\displaystyle \rho }Here's a breakdown of what Spearman's correlation tells you: Monotonic Relationship: It measures how well the ranking of one variable corresponds to the ranking of another. Imagine ordering two lists of values, one for each variable. A positive correlation (values closer to 1) indicates that as the values in one list increase (or decrease), the values in the other tend to follow suit (increase or decrease as well). A negative correlation (values closer to -1) means they move in opposite directions (one increases as the other decreases). A correlation close to 0 suggests no clear relationship between the rankings. Strength of the Relationship: The coefficient itself (usually denoted by the Greek letter rho (ρ)) ranges from -1 to 1. The closer the value is to 1 or -1, the stronger the monotonic relationship, positive or negative respectively. A value around 0 indicates a weak or no relationship between the rankings. When to Use Spearman's Correlation: Ordinal Data: When your data is ranked or ordinal (e.g., customer satisfaction ratings, exam grades), Spearman's correlation is a good choice because it focuses on the order rather than the specific values. Non-Linear Relationships: If you suspect a non-linear connection between your variables, Spearman's correlation can identify that connection, unlike Pearson's correlation which is limited to linear relationships. In essence, Spearman's correlation tells you the direction and strength of the order, regardless of whether the relationship is perfectly linear or not. This makes it a versatile tool for analyzing relationships in various scenarios.In the context of Spearman's correlation for classification models, ranking of variables refers to ranking the data instances (samples) based on the model's predictions. It's not directly concerned with ranking the features (variables) themselves.Here's how it's applied to measure similarity of classification models: Make Predictions: Let's say you have two classification models (Model A and Model B) you want to compare. You run both models on the same dataset. Rank the Instances: For each model, instead of just considering the predicted class label (e.g., cat or dog), you look at the confidence score (or probability) associated with each prediction. This score indicates how certain the model is about its prediction. Now, rank all the data instances (samples) in the dataset based on their confidence scores for each model separately. The instance with the highest confidence score for a particular class gets ranked 1st, the second highest gets ranked 2nd, and so on. Spearman's Correlation: Finally, you calculate the Spearman's correlation coefficient between the two ranked lists (one from Model A and one from Model B). This coefficient reflects how similar the order (ranking) of the instances is between the two models. High positive correlation (close to 1): The models tend to agree on the ranking of most instances, suggesting they make similar predictions. Low positive correlation (closer to 0): The models have some agreement on the ranking but not a strong one. They might prioritize some instances differently. Negative correlation (close to -1): The models often disagree on the ranking, indicating they prioritize instances quite differently. Why Ranking and Not Just Predictions?Classification models often output probabilities for each class. Directly comparing these probabilities can be misleading, especially if the models have different biases or decision boundaries. By focusing on the ranking based on confidence scores, Spearman's correlation captures how well the models agree on the relative order of instances, even if their absolute probability scores differ.This approach provides a more nuanced understanding of how similar the models are in their decision-making process.

visual_ex_categorical

Thu, 11 Apr 2024 13:41:11 GMT

Data_types

Thu, 11 Apr 2024 13:37:45 GMT

XGBoost

Tue, 09 Apr 2024 09:38:23 GMT

This decision tree model is called extreme because it improves bit a large factor the efficiency and speed of regular Gradient Boosting, on which it's based.The XGBoost library has a GPU-accelerated implementation and the usage is simply to specifiy gpu_hist instead of hist for CPU. It is noted that for small datasets, CPU training may be faster since it's well optimized.clf = xgb.XGBClassifier(tree_method="gpu_hist")Other important parameters to tune are the following :
n_estimators: This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees gives you better performance but makes your code slower. It is often set to a large value and early stopping is used to roll back the model to the one with the best performance 3.
max_depth: This is the maximum depth of a tree. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables. 3.
learning_rate: It is used to prevent overfitting. After boosting, the model will be a weighted sum of weak prediction. The learning_rate shrinks the feature weights to make the boosting process more conservative. The smaller the learning rate, the more conservative the algorithm will be 3.
subsample: This is the subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees and this will prevent overfitting 3.
colsample_bytree: This is the subsample ratio of columns when constructing each tree, ie the fraction of the features to be used at each step of the tree building process. This is useful to control over-fitting. After a tree is built, this tree will not change, so we can use a smaller subset to construct the tree. The smaller the value, the more conservative the algorithm will be 3. If colsample_bytree is set to 1, then all features are used at each step. If it's set to 0.5, then half of the features are used at each step.
min_child_weight: min_child_weight: This parameter controls the minimum sum of instance weight (hessian) needed in a child. In other words, it determines the minimum number of instances that must be in a node in order for the node to be split. If the sum of instance weights in a node is less than min_child_weight, then the node will not be split. This parameter is used to control overfitting by preventing the model from splitting nodes that have too few instances. The larger the min_child_weight, the more conservative the algorithm will be, meaning it will be more resistant to overfitting 2, 1. In the context of the XGBoost model, a "child" refers to a node in the decision tree. Each node in the tree is a "parent" of the nodes that it branches off to, and these "child" nodes are the ones that the min_child_weight parameter controls. This parameter helps to prevent the model from creating overly complex trees that may fit the training data too closely, but not the test data.Yes, it is possible to use XGBoost for spatial data, but it’s not exactly the same as how convolutional neural networks (CNNs) handle spatial data.
XGBoost is a gradient boosting framework that can handle various types of structured data, including spatial data 1. However, unlike CNNs, which can inherently handle spatial data due to their convolutional nature, XGBoost does not directly consider the spatial relationships between features.In the context of spatial data, such as maps, XGBoost would treat each pixel or spatial unit as an independent feature. If you want to include spatial relationships or dependencies between different spatial units (like how CNNs do), you would need to engineer these features yourself. For example, you could create new features that capture the relationships between a pixel and its neighboring pixels.
There is research on using XGBoost with spatial data. For instance, a study on traffic flow prediction used XGBoost to predict traffic states by utilizing the origin-destination relationship of segment flow data between upstream and downstream on the highway 1. This is an example of how spatial relationships can be incorporated into the XGBoost model.Remember, the key to using XGBoost with spatial data effectively is feature engineering. You need to create meaningful features that capture the spatial relationships in your data. This might require domain knowledge and a good understanding of your data.To sample fixed-size 50x50 pixels training images from a GeoTIFF stack of 50 variables using Python, you can use the rasterio and numpy libraries. Here’s a basic example of how you might do this:import rasterio import numpy as np # Open the GeoTIFF file with rasterio.open('your_file.tif') as src: # Read the whole stack into a 3D numpy array img_stack = src.read() # Define the size of the patches patch_size = 50 # Get the dimensions of the image stack num_bands, height, width = img_stack.shape # Calculate the number of patches that fit in the height and width num_patches_height = height // patch_size num_patches_width = width // patch_size # Initialize a list to hold the patches patches = [] # Loop over the image to extract the patches for i in range(num_patches_height): for j in range(num_patches_width): # Extract a patch patch = img_stack[:, i*patch_size:(i+1)*patch_size, j*patch_size:(j+1)*patch_size] patches.append(patch) Sure, here’s how you can rewrite the loop using list comprehension:patches = [img_stack[:, i*patch_size:(i+1)*patch_size, j*patch_size:(j+1)*patch_size] for i in range(num_patches_height) for j in range(num_patches_width)] The list comprehension iterates over each possible patch in the image, just like the double for loop, but does it all in one line. I hope this helps! Let me know if you have any other questions. This code will give you a list of 3D numpy arrays, each of size (50, 50, 50), representing a 50x50 pixel patch from each of the 50 bands in your GeoTIFF stack.Please replace 'your_file.tif' with the path to your GeoTIFF file. Also, this is a simple example and doesn’t handle cases where the height and width of the image are not perfectly divisible by the patch size. You might need to adjust it based on your specific needs.Remember to install the necessary libraries by running pip install rasterio numpy in your Python environment.For the specific code snippet you provided, here are a few potential optimizations: Preallocate the list of patches: If you know the number of patches in advance, you can preallocate the list with None values and then fill in each patch. This can be faster than appending to the list. Use a generator: If you don’t need all patches in memory at once, you can use a generator to create patches on the fly. Use NumPy slicing: If your image stack is a NumPy array, use NumPy’s built-in slicing to extract patches, which is faster than using a loop. batchsize = 100 000 seems to be a good value for 8 GB :
(Bard answer) XGBoost's classification process differs slightly when the expected output is a string compared to a number. This is because XGBoost uses a different loss function for categorical classification tasks, which involves predicting probabilities for each class instead of just a single class label.When predicting a number, XGBoost typically employs the mean squared error (MSE) loss function, which aims to minimize the squared difference between the predicted and actual values. However, for categorical classification, the algorithm utilizes the cross-entropy loss function, which measures the average difference between the predicted probability distribution and the true distribution of classes.This difference in loss functions leads to a slightly different optimization process for categorical classification. XGBoost still constructs a decision tree ensemble, but the splitting criteria and error calculations are adjusted to better handle categorical data.In addition, when dealing with string labels, XGBoost may require additional preprocessing steps to ensure the data is suitable for the classification process. This could involve converting the strings to numerical representations using techniques like one-hot encoding or label encoding.Overall, XGBoost can effectively handle both numerical and categorical classification tasks, but it employs slightly different strategies for each type of output. The cross-entropy loss function and additional preprocessing steps are crucial for accurate classification when dealing with strings.On using softmax instead :Yes, you can manually specify that XGBoost should use the softmax loss function instead of the cross-entropy loss function for categorical classification tasks. However, this is not recommended unless you have a specific reason for doing so. The cross-entropy loss function is the standard choice for categorical classification, and it is well-optimized by XGBoost. Using the softmax loss function may lead to slightly different results, but it is unlikely to be significantly better than the cross-entropy loss function.In fact, the softmax loss function is not even the default choice for multi-class classification in XGBoost. The default objective function for multi-class classification is "multi:softprob", which also uses cross-entropy loss but outputs probability estimates for each class rather than just a single class label. This can be useful for tasks like predicting class probabilities or performing multi-class classification with ensemble models.If you do decide to use the softmax loss function for categorical classification, you should be aware that it is not as well-supported by XGBoost as the cross-entropy loss function. For example, XGBoost does not provide any built-in evaluation metrics for the softmax loss function. You will need to calculate these metrics yourself or use a custom evaluation metric.Overall, it is generally better to stick with the cross-entropy loss function for categorical classification in XGBoost unless you have a specific reason to use the softmax loss function. The cross-entropy loss function is well-optimized by XGBoost and is more widely supported.Here are some reasons why softmax is not recommended for XGBoost for categorical classification tasks: Cross-entropy loss function is well-optimized: XGBoost is specifically designed to minimize cross-entropy loss, making it more efficient and accurate for categorical classification. The algorithm utilizes various techniques to optimize the loss function, including the gradient descent algorithm and regularization methods. Standard choice for categorical classification: Cross-entropy loss is the standard choice for categorical classification tasks, and it is widely accepted in the machine learning community. This consistency ensures that XGBoost models are easily comparable to other models trained with the same loss function. Output probability estimates: The default objective function for multi-class classification in XGBoost, "multi:softprob", also uses cross-entropy loss but outputs probability estimates for each class. This is a more informative output compared to just the predicted class label, making it useful for tasks like predicting class probabilities or performing multi-class classification with ensemble models. Limited compatibility: Softmax loss is not as well-supported by XGBoost as cross-entropy loss. For instance, there are no built-in evaluation metrics for the softmax loss function, requiring additional calculations or custom metrics. This can be inconvenient and hinder the evaluation process. Potential for inconsistencies: While softmax can be used with XGBoost, it may lead to inconsistencies in predictions compared to the cross-entropy loss function. This is because the optimization process for the two loss functions differs slightly. In summary, while softmax is not strictly prohibited in XGBoost, the cross-entropy loss function is generally preferred due to its superior optimization, widespread acceptance, and compatibility with the default objective function. Using softmax may introduce additional complexities and inconsistencies, making it less recommended for practical applications.XGBoost can handle missing values (NaN) internally during the tree building process. It doesn't require explicit imputation of missing values before training. Here's a breakdown of how XGBoost deals with NaNs:1. Missing Value Detection:XGBoost automatically identifies missing values during training. It recognizes features with missing values based on a pre-defined missing value indicator (typically NaN).2. Splitting on Missing Values:When building a decision tree, XGBoost considers the presence or absence of missing values in a feature as a potential split point. It evaluates the information gain of splitting the data based on whether a value is missing or not.3. Best Split Determination:XGBoost chooses the split point that leads to the best separation of data points based on the objective function (e.g., minimizing classification error). This might involve sending data points with missing values to one branch of the tree and those with valid values to another.4. Surrogate Splits:In some cases, XGBoost might create "surrogate splits" for missing values. These are splits based on another feature that can act as a proxy for the feature with missing values. This helps improve the model's ability to handle missing data.5. Internal Handling:The specific details of how XGBoost handles missing values during splitting and tree building are part of its internal algorithm. However, it doesn't require users to explicitly impute or encode missing values before training.Advantages of XGBoost's Missing Value Handling: Automatic Detection: No need for manual identification of missing values. Flexibility: XGBoost can learn from the missingness itself, potentially capturing patterns in how missing values relate to other features. Less Data Preprocessing: Saves time and effort compared to manual imputation or encoding. However, it's important to note: Random Splits: Missing value splits can sometimes introduce randomness into the tree building process. Performance Impact: Depending on the amount and distribution of missing data, XGBoost's performance might be affected. Alternatives for Handling Missing Values:While XGBoost can handle missing values internally, you might still consider alternative approaches in specific scenarios: High Proportion of Missing Values: If a feature has a very high percentage of missing values, it might be better to remove that feature altogether. Domain Knowledge: If you have domain knowledge about missing values, you could use specific imputation techniques (e.g., mean/median imputation). In conclusion, XGBoost offers a convenient way to handle missing values during training. However, it's valuable to understand its behavior and consider alternative approaches if necessary.

Spatial Data Analysis

Wed, 03 Apr 2024 14:39:06 GMT

While classical Data Analysis assumes that the samples in the data are independent and identically distributed or iid, Spatial Data Analysis takes into account the spatial autocorrelation present in the data that states : nearby points are more similar than distant points in most cases.
Nonspatial sampling results in over-optimistic predictive models (caused by the spatial correlations) that can predict the input training data accurately but have marginal performance in terms of extrapolation (i.e., predicting patterns that have not been seen during the training process) A Truly Spatial Random Forests Algorithm for Geoscience Data Analysis and Modelling Regular machine learning algorithms treat each location like an isolated point, only considering the features themselves to make predictions. This is like looking at individual pixels in an image without considering the bigger picture.In spatial data analysis, particularly when dealing with images or maps, we often use terms related to both the location and the spectral characteristics of the data points. Here's a breakdown : Pixel-wise Spectral Information: Pixel-wise: This refers to analyzing each individual pixel (the smallest unit) in the data. Imagine a map as an image; each tiny square on the map is a pixel. Spectral Information: This refers to the specific characteristics of the data measured at each pixel. These characteristics can be related to wavelengths of light, chemical composition, or other properties that can be captured by sensors. For example, in a satellite image, pixel-wise spectral information would tell you the specific color value (red, green, blue, etc.) for each pixel. Local Spatial-Spectral Information: Local: This refers to analyzing a small neighborhood around a specific pixel, not just the pixel itself. Imagine looking at a few pixels surrounding the one you're interested in, like a small box on the map. Spatial-Spectral: This combines both spatial information (location) and spectral information (characteristics). It considers how the spectral properties of a pixel are related to the spectral properties of its neighbors. This field deals with analyzing data that has a spatial component, meaning it's associated with locations. Spatial statistics look for patterns, trends, and relationships between data points based on their location.There are two kinds of statistics, parametric or non-parametric : Parametric methods assume the data follows a specific probability distribution (like a normal distribution). They rely on estimating the parameters of that distribution to understand the data. Nonparametric methods make fewer assumptions about the underlying distribution of the data. They focus on directly analyzing the patterns in the data itself, without needing to fit a specific model. The order of a data here means the number of points considered together for analysis : First-order => 1 point aka pixel-wise Second order => 2 points, pairs of data Higher-order => 3+ pointsBy using nonparametric higher-order statistics, researchers can capture more complex spatial patterns that might not be evident with simpler methods. This can be particularly useful in fields like ecology, geology, and urban planning where understanding the spatial relationships between features is crucial.

Kriging

Wed, 03 Apr 2024 14:29:23 GMT

Kriging is a powerful geostatistical interpolation method for Spatial Data Analysis. This elegant algorithm tackles the crucial task of spatial interpolation, predicting values at unsampled locations based on known data points. Unlike simpler methods like nearest-neighbor interpolation, Kriging incorporates the spatial relationships between your data points, leading to more accurate and nuanced predictions.Here's the essence of Kriging: Leveraging Spatial Autocorrelation: Kriging builds a semivariogram, a statistical tool that captures how similar your data points are based on their distance. This helps quantify the spatial trends within your data. Weighted Averaging: At an unknown location, Kriging predicts the value by taking a weighted average of the values at nearby known locations. These weights are determined based on the semivariogram, ensuring closer points have a stronger influence. Uncertainty Estimation: Unlike a simple guess, Kriging provides an uncertainty measure for each prediction. This tells you how confident you can be in the predicted value, accounting for the inherent variability in your data. Different Kriging variants: Various types of Kriging exist, each suited for different scenarios based on assumptions about the data (stationarity, known mean, trend presence). Common variants include: Ordinary Kriging (OK): Assumes a constant mean and known semivariogram. Universal Kriging (UK): Accounts for a linear trend in the data while considering spatial autocorrelation. Kriging with Trends (KT): Similar to UK but allows more flexible trend models. Now, let's explore some exciting use cases: Environmental Science: Predicting pollutant concentrations across a city, estimating soil quality variations in a field, or mapping groundwater levels. Geology & Mining: Modeling mineral deposits, interpolating ore grades, or predicting subsurface geological features. Precision Agriculture: Optimizing fertilizer application based on soil nutrient maps, predicting crop yields across a field, or managing water resources efficiently. Public Health: Studying the spread of infectious diseases, modeling air quality patterns, or predicting heatwave intensities in different urban areas. Key advantages of Kriging: Accuracy: Incorporates spatial trends, leading to more accurate predictions compared to simpler methods. Uncertainty quantification: Provides confidence intervals for predictions, crucial for decision-making. Flexibility: Different variations of Kriging exist, each suited to specific data types and assumptions.

hf_token_colab_secret

Thu, 21 Mar 2024 16:06:10 GMT

random_sampling_strats

Wed, 20 Mar 2024 16:30:21 GMT

spatial_kfold

Wed, 20 Mar 2024 14:19:38 GMT

Correlogram

Tue, 19 Mar 2024 15:25:46 GMT

A correlogram in spatial data science is a visual tool used to explore spatial autocorrelation. It's a graph that depicts the relationship between the similarity of observations and the distance separating them. In simpler terms, it shows how values of a variable at different locations are correlated with each other.Here's the key concept: spatial autocorrelation means nearby observations tend to have similar values (high or low) compared to distant ones. The correlogram helps you identify the distance range where this spatial dependence exists.Key features of a correlogram: X-axis: Represents the distance (or lag) between observations. Y-axis: Represents a measure of spatial autocorrelation, like Moran's I or Geary's C. The plot: Shows how the autocorrelation statistic changes with increasing distance. By looking at the correlogram, you can see if there's: Positive spatial autocorrelation: Nearby observations have similar values (clustering). Negative spatial autocorrelation: Nearby observations have dissimilar values (dispersion). No spatial autocorrelation: Values are randomly distributed across space. A regular correlation matrix, commonly used in statistics, deals with relationships between two variables across all observations, regardless of their location. It tells you how changes in one variable are associated with changes in another.In contrast, a correlogram focuses on a single variable within a spatial context. It explores how the value of that variable at one location is related to the values at other locations, considering the spatial separation (distance) between them.Here's an analogy: Correlation matrix: Like comparing income levels of people from different professions (doctor vs teacher), regardless of where they live. Correlogram: Like comparing income levels within a neighborhood, to see if there's a tendency for houses closer together to have similar or dissimilar income values. In essence, a correlogram is a specialized tool for spatial data analysis, providing insights into the spatial structure of a single variable.

spatial autocorrelation

Tue, 19 Mar 2024 10:01:52 GMT

Spatial autocorrelation is a statistical phenomenon that describes the tendency for similar values of a variable to occur in close proximity to each other. It is measured by comparing the values of a variable at different locations to assess whether they are more similar to their neighbors than they are to random locations.What is "auto" in spatial autocorrelation?The prefix "auto" means "self" or "same". In spatial autocorrelation, it refers to the fact that the correlation is between values of the same variable at different locations. This is in contrast to cross-correlation, which is the correlation between values of different variables at the same location.Here are some examples of spatial autocorrelation: Temperature: Temperatures tend to be more similar in close proximity than they are far apart. Rainfall: Rainfall tends to be more clustered in space than it is random. Population density: Population density tends to be higher in urban areas and lower in rural areas. Spatial autocorrelation can be measured using a variety of statistical methods, such as Moran's I and Geary's C. These methods are typically used to assess the strength and! direction of spatial autocorrelation.Applications of spatial autocorrelationSpatial autocorrelation is used in many different fields, including: Geology: Spatial autocorrelation can be used to identify patterns in geological features, such as faults and fractures. Ecology: Spatial autocorrelation can be used to study the distribution of plants and animals. Urban planning: Spatial autocorrelation can be used to identify areas of high and low crime rates. Public health: Spatial autocorrelation can be used to study the spread of disease. Spatial autocorrelation is a powerful tool for understanding the spatial patterns of data. By understanding spatial autocorrelation, we can better understand the processes that create these patterns.
To avoid training Machine Learning models with spatially autocorrelated samples, there are multiple ways to separate the data in the feature space aswell as in the spatial space. One of them is Spatial Cross-validation, which separates the cross-validation folds into spatially independent entitites to avoid such spatial correlation.
More details can be found here : Spatially autocorrelated training and validation samples inflate performance assessment of convolutional neural networks for the particular case of CNNs.
There are many ways to measure spatial autocorrelation, one of them being a Correlogram. A correlogram in spatial data science is a visual tool used to explore spatial autocorrelation. It's a graph that depicts the relationship between the similarity of observations and the distance separating them. In simpler terms, it shows how values of a variable at different locations are correlated with each other.

Data preprocessing

Thu, 29 Feb 2024 14:28:04 GMT

Data preprocessing is a crucial step that is done on data before doing Data Analysis or before feeding it to models in Machine Learning or deep learning. It usually involves steps such as normalization or standardization, dealing with outliers, missing data or reducing data dimensionality or cardinality through dimensionality reduction techniques such as Principal Component Analysis, UMAP or T-SNE . PCA illustration :
UMAP illustration :
T-SNE illustration :
These techniques assume some things about the underlying data :
PCA is sensitive to the variances of the initial variables, so if the variables are in different scales, the PCA might not work as expected. Therefore, standardization is often preferred before applying PCA because it ensures that each variable contributes equally to the principal components, making the PCA results more interpretable and reliable. In practice, if you're dealing with data that has a normal or Gaussian distribution, standardization is a good choice. If your data is not normally distributed, you might need to consider other preprocessing steps or transformations to make it suitable for PCA. Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data
The data is **uniformly distributed on Riemannian manifold;
The Riemannian metric is locally constant (or can be approximated as such);
The Manifold is locally connected. T-sne is a non-parametric algorithm, which means that it does not make many assumptions about the data or the way that features are related. If you are working with a dataset that has many features that are not linearly related, you may be better off using t-sne than another algorithm that makes stronger assumptions about the structure of the input data. Normalization and standardization are two different preprocessing techniques used in data analysis and machine learning to bring data into a common scale. The primary difference between the two lies in the method of scaling and their objectives.Normalization scales the data to a fixed range, usually . It is also known as Min-Max scaling. This method is useful when you want to bring all variables onto the same scale, but it does not take into account the distribution of the data. This means that normalization might not be suitable for data with skewed distributions, as it does not change the shape of the distribution.
Standardization, on the other hand, scales the data based on the mean and standard deviation of the data, resulting in a distribution with a mean of 0 and a standard deviation of 1. This method is useful when you want to ensure that all variables contribute equally to the model, regardless of their original scale. It is particularly useful for algorithms that are sensitive to the scale of the input features, such as Support Vector Machines (SVMs), K-Nearest Neighbors (KNN), and Principal Component Analysis (PCA).In other words, Standardization (or Z-score normalization) is the process of rescaling the features so that they’ll have the properties of a Gaussian distribution with μ=0 and σ=1.Drawbacks :Normalizing the data is sensitive to outliers, so if there are outliers in the data set it is a bad practice. Standardization creates a new data not bounded (unlike normalization).In short, Normalization is sensitive to outliers and should be avoid it if the dataset contains extreme values, while standardization is more robust to outliers and can create a more representative distribution.When dealing with non-normally distributed data with linear correlations between features, two main dimensionality reduction techniques are suitable:1. Principal Component Analysis (PCA): Assumptions: While PCA works best with normally distributed data, it can still be effectively used for dimensionality reduction even if the data doesn't follow a normal distribution, especially if the primary goal is to capture linear relationships between features. This is because PCA focuses on maximizing variance, which often aligns with linear correlations. Process: PCA identifies the principal components (PCs) that capture the most variance in the data. These PCs represent uncorrelated directions of maximum variance. By selecting a subset of the top PCs, you can achieve dimensionality reduction while preserving the information about linear correlations. Limitations: PCA is sensitive to scaling. Features with larger scales will have a greater impact on PCs, regardless of their underlying relationships. Therefore, standardization or normalization is crucial before applying PCA to ensure each feature contributes equally. 2. Partial Least Squares (PLS) Regression: Assumptions: PLS is specifically designed to handle situations where the data might not be normally distributed and there are linear relationships between features and a target variable (regression setting). It focuses on finding latent variables (LVs) that are maximally correlated with the target variable while explaining variance in the features. Process: PLS identifies LVs that explain both the relationship between features and the target variable. By selecting a subset of the top LVs, you can reduce dimensionality while preserving the information relevant to the target variable and linear relationships between features. Advantages: PLS addresses the issue of scaling sensitivity and is robust to non-normality, making it a good choice when dealing with non-normal data and linear correlations in a regression context. Choosing the right technique: If you don't have a target variable and solely focus on capturing linear relationships between features: PCA is a suitable choice, even with non-normal data, as long as you standardize or normalize your features. If you have a target variable and want to reduce dimensionality while preserving information relevant to the target variable and linear relationships between features: PLS regression is the preferred option due to its robustness to non-normality and ability to handle the target variable. Remember to evaluate the effectiveness of your chosen technique based on the specific characteristics of your data and the intended use of the reduced dimensionality representation.

T-SNE

Tue, 27 Feb 2024 14:28:42 GMT

guide to TSNE : [https://distill.pub/2016/misread-tsne/] openTSNE library : [https://opentsne.readthedocs.io/en/stable/examples/04_large_data_sets/04_large_data_sets.html] PCA gets the global structure preserved better than t-SNE. Global structure as in large groups, whereas t-SNE preserves smaller groups. This is due to the k-NN clustering being limited in what it samples to its neighbors.
On why we don't use it for clustering : "There are reasons why t-sne is not used as a clustering algorithm.First, as you point out yourself, that t-sne does not generate any cluster assignments. Instead, it performs dimensionality reduction, embedding the data into a low dimensional space that is easy to visualize. You could, of course, use a standard clustering algorithm such as k-means on this embedding to get clusters. However, if the clusters exist in the data, you should not need to map it to 2D first.
Secondly, and this is quite crucial, t-sne may create embedding containing clusters that don't really exist in the real data. Also, it may disregard clusters that do exist in the real data. Depending on the randomness in the algorithm and chosen hyperparameters, you may get very different results. I recommend reading the article How to Use t-SNE Effectively discussing some of these unexpected scenarios. Another, related issue is reproducibility of t-sne results.Finally, if there are real clusters in your data, you should be able to find them using standard, well understood clustering algorithms. Using methods that people understand gives way more credibility to your results and makes it easier to interpret them." [https://stats.stackexchange.com/questions/447236/can-t-sne-be-directly-used-as-a-clustering-algorithm]What we can do is use it to visualize the features captured by our model and visually inspect the quality of the decision boundaries for image classification. To do that we can extract the output of intermediate NN activations, n-dimensional vectors that are projected to 2D using t-SNE. For example, the last activations before the classification layers can give us insight on the learned representation of the network.

To get the most out of our machines, it is usually faster to run these models on GPUs. To do this, you can use the Rapids AI library running on a WSL2 Ubuntu 22.04+Windows 10 subsystem and following the guide written in Running jupyter or IDE on WSL2 . The library contains cuML which is a sklearn API imitation, reproducing most algorithms but implemented with a CUDA compatible manner. This includes t-SNE and UMAP aswell as PCA for dimensionality reduction. There is also a way to implement RF and XGBoost using the GPU.

tsne_ex

Tue, 27 Feb 2024 14:28:35 GMT

PCA

Tue, 27 Feb 2024 14:24:35 GMT

Kullback-Leibler Divergence

Tue, 27 Feb 2024 10:48:36 GMT

Graph Neural Networks

Mon, 26 Feb 2024 15:47:15 GMT

Graph Neural Networks (GNNs) are a type of deep learning model specifically designed to work with data structured as graphs. Unlike traditional neural networks that operate on grids or sequences, GNNs exploit the connections and relationships between nodes (points) in a graph to make predictions or classifications. Here's a simplified breakdown of how they work:1. Graph Input: Imagine a social network as a graph, where nodes are people and edges represent their connections. Each node can have features like name, interests, etc. GNNs take this graph as input, including node and edge features.2. Message Passing: The core idea of GNNs is "message passing," where information is exchanged between connected nodes. In each layer, each node takes its own features and combines them with features received from its neighbors through a message passing function. This function aggregates information from the neighborhood, considering edge information as well.3. Updating Node Features: The aggregated information in each layer is then used to update the node's own features. This updated feature becomes the "new understanding" of the node, incorporating its surroundings.4. Multiple Layers: This message passing and feature update process happens in multiple layers, allowing the network to learn complex relationships across the graph. Deeper layers capture more global information.5. Output: Depending on the task, the final node features can be used for various purposes. For example, in social network analysis, node features might be used to predict someone's interests or community membership.Key points to remember: GNNs are permutation invariant, meaning their output doesn't change if the order of nodes changes (rearranging a social network doesn't affect predictions). Different GNN architectures exist, using different ways to aggregate information and update features. GNNs are powerful tools for analyzing and understanding complex relationships in various domains like social networks, molecules, recommendation systems, and more. If you'd like to delve deeper, I can explain specific GNN architectures or provide resources for further exploration.Graph Neural Networks (GNNs): Pros: Can effectively capture spatial relationships between vegetation patches by constructing a graph where nodes represent locations and edges represent connections based on proximity or interaction. Able to utilize additional geospatial features beyond the image data (e.g., elevation, soil type) through node and edge attributes. Cons: May not be ideal if spatial relationships are weak or irrelevant for classification. Might require careful graph construction and feature engineering. Can be computationally expensive for large datasets and complex graph structures.

SourceThe majority of GNNs are Graph Convolutional Networks, and it is important to learn about them before jumping into a node classification tutorial. The convolution in GCN is the same as a convolution in convolutional neural networks. It multiplies neurons with weights (filters) to learn from data features.
It acts as sliding windows on whole images to learn features from neighboring cells. The filter uses weight sharing to learn various facial features in image recognition systems - Towards Data Science. Now transfer the same functionality to Graph Convolutional networks where a model learns the features from neighboring nodes. The major difference between GCN and CNN is that it is developed to work on non-euclidean data structures where the order of nodes and edges can vary.
Now, GCNs take inspiration from Convolutional Neural Networks (CNNs), which excel at processing grid-structured data like images. They introduce the concept of graph convolution, adapting the idea of filtering operations from CNNs to the graph domain.Here's the GCN Breakdown:Feature Embeddings: Each node in the graph starts with a feature vector representing its attributes.Neighborhood Aggregation: For each node, the GCN considers its neighbors and aggregates their feature vectors using a specific convolution operation. This operation considers both node features and edge information (e.g., weights, types). Non-linearity: The aggregated information passes through a non-linear activation function, introducing complexity and expressiveness.Weight Updates: Learnable weights are used within the convolution operation, and they are updated during training to optimize the model's ability to capture relevant information from the graph.Multiple Layers: Similar to CNNs, GCNs often use multiple convolutional layers stacked together, allowing them to extract features at different levels of abstraction.Key Differences: Regular GNNs: Can use various message-passing schemes and aggregation functions, offering flexibility but potentially requiring more careful design. GCNs: Leverage the simplicity and efficiency of convolutional operations, but might be less flexible for tasks requiring highly customized message passing. Advantages of GCNs: Efficient: Utilize efficient convolutional operations, making them faster than some regular GNNs. Easy to Implement: Building on the familiar concept of convolutions from CNNs simplifies implementation and understanding. Powerful Feature Learning: Capture both node features and structural information effectively. Disadvantages of GCNs: Limited Expressivity: May not be as flexible as some regular GNNs for complex message-passing tasks.
Assumptions: Implicitly assume smoothness in the graph data, which might not hold for all types of graphs. Here, smoothness refers to the assumption that nearby nodes in the graph share similar information and characteristics. - This process inherently assumes "smoothness" because it expects the features of neighboring nodes to be more similar compared to nodes further away. Choosing the Right Tool:The best choice between a GCN and a regular GNN depends on your specific data and problem. Consider factors like: Graph structure: If your graph has strong spatial smoothness, GCNs might be a good choice. Task complexity: If your task requires highly customized message passing, a regular GNN might offer more flexibility. Computational resources: If efficiency is crucial, GCNs might be more suitable. Here's a shortened version of the training process for a classification GNN:1. Preprocessing: Build your spatial data graph with relevant features for nodes and edges. Label a portion of nodes for training and split the data into training, validation, and test sets. 2. Model and Training: Choose a GNN architecture (e.g., GCN, GraphSage) and add a classification layer. Select an optimizer, loss function (e.g., cross-entropy), and train the model with the training data. Tune hyperparameters based on validation set performance. 3. Evaluation: Evaluate the model on unseen data from the test set using relevant metrics. Visualize results to understand predictions and identify potential issues. Splitting graph data for training and testing while preserving the graph structure is indeed a challenge. Here are some strategies you can adopt, depending on your specific situation:1. Node-level Splitting: Random Split: Randomly select nodes for training and testing, while ensuring both sets cover diverse parts of the graph. This may work if there's no strong spatial dependence or community structure. Stratified Split: Divide nodes into groups based on features or community detection algorithms. Then, randomly sample from each group to avoid under-representation of specific areas in the training or testing set. Node Importance Sampling: Prioritize nodes based on their importance or centrality in the graph, ensuring influential nodes are present in both sets. 2. Subgraph Splitting: Connected Components: If your graph consists of distinct, unconnected islands, treat each island as a separate subgraph and randomly split them for training and testing. Community Detection: Identify communities within the graph and split them, preserving local connectivity. This requires reliable community detection techniques and might not be suitable for all graph structures. Random Walk Sampling: Simulate random walks on the graph and collect subgraphs around starting points from different areas. This can capture diverse network regions while maintaining local connections. 3. Edge-level Splitting: Random Edge Split: Randomly remove edges to create disconnected subgraphs for training and testing. This method might disrupt information flow and is generally less common. Ego-Network Splitting: For each node, create a subgraph with its immediate neighbors and randomly split these ego-networks. This preserves local structure but might not capture larger network patterns. Additional Considerations: Task: The splitting strategy should align with your classification task. Consider whether preserving local connections or global network structure is more crucial for accurate predictions. Data characteristics: Analyze your graph's properties, like community structure, node importance, and connectivity, to choose an appropriate splitting approach. Evaluation: Assess how different splitting methods affect your model's performance on unseen data and choose the one that leads to the most generalizable results. Remember, there's no one-size-fits-all solution. Experimenting with different splitting strategies and evaluating their impact on your specific data and task will help you identify the most effective approach for training and testing your GNN model while maintaining the integrity of the graph structure.Training a GNN or GCN on disconnected graph islands and generalizing to unseen parts raises some considerations. Here are some approaches you can explore:1. Graph Augmentation: Connect graph islands: If spatial proximity or similarity information is available, you can create artificial edges between islands based on these criteria. This allows some information flow and learning across initially disconnected parts. Random Walk with Restart: Simulate random walks on the entire graph, allowing occasional jumps between islands to encourage exploring different disconnected parts and capturing some global context. Node Feature Augmentation: Augment node features with additional information, like estimated distances to unseen areas or summary statistics of neighboring islands, enriching the representation without physical connections. 2. Semi-supervised Learning: Leverage additional labelled data from connected graphs in a similar domain. Train the model on these known graphs and then fine-tune it on your partially labelled disconnected islands. This leverages existing knowledge for generalization. Utilize techniques like label propagation or self-training to assign pseudo-labels to unlabeled nodes within your islands, expanding the training data and promoting learning from unlabeled information. 3. Hierarchical GNNs: Employ hierarchical GNN architectures that learn at multiple levels of granularity. On the lower level, each island is processed individually. Then, information is aggregated at a higher level, capturing more global features and potentially bridging the gap between disconnected parts. 4. Metric Learning: Train a separate metric learning model to estimate distances between nodes in different islands. This can help guide information flow within the GNN by using these estimated distances when aggregating neighborhood information. 5. Attention Mechanisms: Incorporate attention mechanisms within the GNN architecture. These mechanisms allow the model to focus on relevant neighboring nodes, even if they are in different islands, potentially capturing long-range dependencies and generalizing better. Additional Tips: Explore different GNN architectures beyond GCNs, as they might offer more flexibility and expressiveness for your specific data. Experiment with hyperparameters and carefully evaluate model performance on unseen hold-out data from different islands. Consider visualization techniques to understand how the model is utilizing connections within and across islands. Below, we’ve outlined some of the types of GNN tasks with examples: Graph Classification: we use this to classify graphs into various categories. Its applications are social network analysis and text classification. Node Classification: this task uses neighboring node labels to predict missing node labels in a graph. Link Prediction: predicts the link between a pair of nodes in a graph with an incomplete adjacency matrix. It is commonly used for social networks. Community Detection: divides nodes into various clusters based on edge structure. It learns from edge weights, and distance and graph objects similarly. Graph Embedding: maps graphs into vectors, preserving the relevant information on nodes, edges, and structure. Graph Generation: learns from sample graph distribution to generate a new but similar graph structure.
(https://www.datacamp.com/tutorial/comprehensive-introduction-graph-neural-networks-gnns-tutorial)
Application of a Novel Multiscale Global Graph Convolutional Neural Network to Improve the Accuracy of Forest Type Classification Using Aerial Photographs

GCN_vs_CNN

Mon, 26 Feb 2024 15:43:00 GMT

Graph_problems

Mon, 26 Feb 2024 15:40:28 GMT

Metaheuristic Optimization

Mon, 26 Feb 2024 15:28:49 GMT

metaheuristic optimization is an approach of optimization solving using efficient approximations of the solutions instead of computing the exact solutions to an optimization problem. When faced with real-world problems, finding an optimal solution is often costly in terms of time and compute, so methods to approximate decent enough solutions faster and more efficiently were developed. Of those methods, we will focus especially on nature-inspired metaheuristics that mimic naturally observed strategies used by plants, animals or fungi.When you're faced with a difficult problem, metaheuristic optimization is like a smart helper guiding you towards the best solution. It's a way of exploring different options without getting stuck in dead ends, even when the problem is tricky and has many possible answers.from Plant intelligence based metaheuristic optimization algorithms] The reasons of why metaheuristic algorithms are needed are as follows: Optimization problem can have a structure that the process of finding the exact solution cannot be defined Metaheuristic algorithms can be much simpler from the point of decision maker, in terms of comprehensibility. Metaheuristic algorithms can be used as a part of process of finding the exact solution, and learning purpose.
Generally, the most difficult parts of real world problems (which purposes and which restrictions must be used, which alternatives must be tested, how problems data must be collected) are neglected in the definitions made with mathematical formulas. Faultiness of the data used in process of determining model parameters can cause much larger errors than sub-optimal solution produced by metaheuristic approach. (Karaboğa 2011). In a nutshell, metaheuristic optimization is a powerful tool for solving tough optimization problems. It's crucial because it can find good solutions in complex situations where traditional methods struggle. General purposed metaheuristic methods are evaluated in eight different groups which are biology based, physic based, swarm based, social based, music based, chemistry based, sport based, and math based. Furthermore, there are hybrid methods which are combination of these.
Especially, plant intelligence metaheuristic optimization algorithms :
Related topics :
Fungal Kingdom Expansion Algorithm

Multimodal Deep learning

Mon, 26 Feb 2024 15:26:31 GMT

Multimodal deep learning is the discipline of machine learning where the input consists of different modalities, ie. data with a different nature of interpretation such as sound and image .Multi-modal learning with deep learning is a subfield of machine learning that focuses on training models to process and learn from multiple data sources (modalities). These modalities can be diverse and include: Images: Capturing visual information Text: Providing textual descriptions, labels, or captions Audio: Containing sounds or speech Sensor data: Offering measurements like temperature, pressure, or acceleration LiDAR data: Providing 3D point cloud information The key objective is to leverage the complementary information present in these diverse data sources to create a richer understanding and improve performance on various tasks, such as: Image classification: Combining image data with textual descriptions to improve the accuracy of identifying objects in images. Machine translation: Utilizing both audio recordings of spoken language and corresponding text transcripts to enhance translation quality. In a late fusion approach with multi-modal learning, separate sub-models are pre-trained on individual data modalities before being combined for the final task. This pre-training offers several benefits: Leverage existing knowledge: Each sub-model can leverage existing knowledge from pre-trained models on similar data, improving its learning efficiency and performance. For example, a sub-model for image data may be pre-trained on a large image dataset like ImageNet, while a sub-model for text data might be pre-trained on a massive text corpus like Wikipedia. Reduce training complexity: Pre-training each sub-model on specific data types simplifies the overall training process and reduces the computational burden of training a single model from scratch on all modalities combined. Improve feature representation: By pre-training, each sub-model can learn effective feature representations specific to its corresponding data modality. These learned features can then be effectively combined during the late fusion stage. There are various techniques for pre-training sub-models, depending on the specific data modalities and desired task: Transfer Learning: This involves utilizing pre-trained models on similar tasks and data but different modalities. For example, a pre-trained image classification model (e.g., VGG-16) can be utilized as a feature extractor for image data in a multi-modal classification task. Self-supervised Learning: This approach utilizes the data itself to create pseudo-labels or tasks for pre-training. For example, in image data, pre-training can involve predicting image rotations, identifying missing image patches, or coloring grayscale images.
Multi-task Learning: This method involves training a single model on multiple related tasks simultaneously. While not strictly pre-training, this technique can improve the model's ability to learn transferable features across different modalities. Overall, pre-training sub-models in a late fusion approach can significantly benefit multi-modal learning with deep learning by leveraging existing knowledge, reducing training complexity, and improving feature representation for the final task.

Application of a Novel Multiscale Global Graph Convolutional Neural Network to Improve the Accuracy of Forest Type Classification Using Aerial Photographs

Mon, 26 Feb 2024 15:00:20 GMT

Link : Application of a Novel Multiscale Global Graph Convolutional Neural Network to Improve the Accuracy of Forest Type Classification Using Aerial PhotographsHere's a breakdown of the paper "Application of a Novel Multiscale Global Graph Convolutional Neural Network to Improve the Accuracy of Forest Type Classification Using Aerial Photographs" published in Remote Sensing:Problem the Paper Addresses Accurately classifying different forest types from aerial photos is important for forest management and ecological studies. Traditional image classification methods often struggle with the complexity and subtle variations within forest types. The Proposed Solution The authors introduce a new technique called Multiscale Global Graph Convolutional Neural Network (MSG-GCN). This method works by: Thinking of images like graphs: The image is divided into small sections (patches), and each patch is treated as a node in a graph. Connections between patches are established based on how similar they are. Multiscale analysis: The graph is analyzed at different scales, capturing both the fine details within patches and the broader relationships between them. Graph Convolutional Networks (GCNs): A type of neural network designed for graphs is used. This helps extract complex features from the image data, considering the relationships and context. Key Findings The MSG-GCN method significantly outperforms several traditional and deep learning-based image classification techniques in forest type identification. This method's ability to analyze image data at multiple scales and its use of relationships between image sections contribute to its improved accuracy. Limitations and Future Work The authors acknowledge that the computational cost of the MSG-GCN method can be relatively high. They suggest future research on designing more computationally efficient graph-based models. Let's Simplify FurtherImagine you have a picture of a forest. Instead of looking at it as a whole, MSG-GCN breaks it into puzzle pieces. It doesn't just look at each piece individually, it figures out how the pieces relate to each other. Then, it uses a super-smart tool (a graph convolutional network) to understand the patterns and connections both within pieces and across the whole puzzle. This is what allows it to figure out the different types of trees and areas in the forest much better.
Let's break down the model architecture illustrated in the image. Here's a description of the components and their functions:Core Structure: Encoder-Decoder: The model follows a general encoder-decoder architecture. Encoder: Responsible for taking the input image and compressing it into a more abstract representation (think data compression, but for understanding the image). Decoder: This part takes the compressed representation from the encoder and expands it, aiming to reconstruct the important details and generate an output. Overall Goal: This model aims to segment remote sensing images. This means it takes an image and attempts to identify and label different regions of interest within it (e.g., different types of vegetation, land cover, etc.).Key Components: Encoder (E1, E2, E3): This part of the model is responsible for extracting features from the input image (IM). It uses a series of convolutions (Conv1x1xω, Conv3x3x2ω) to analyze the image at different levels of detail.
Local Attention module (LA): Improves the model's focus on relevant areas within the image. Decoder (D1, D2, D3): This part takes the features from the encoder and 'upsamples' them, gradually reconstructing the image at its original resolution. Upsampling (UP1, UP2, UP3): Increases resolution using bilinear interpolation. Convolutions are used here as well, likely to refine and combine the features. Concatenate: Operation to combine information from different parts of the model. This helps preserve features from different levels of processing. SoftMax: This is the final classification layer. It takes the output of the decoder and produces a probability for each pixel belonging to a particular class (what the model thinks that pixel represents). How It Works (Simplified) Input: A remote sensing image (IM) is fed into the model. Encoding: The encoder breaks the image down, analyzing it at different scales using convolutions. Local attention modules help focus on the most important features. Decoding: The decoder takes these features and gradually reconstructs the image, increasing resolution. Classification: The SoftMax layer analyzes the processed image. It assigns each pixel a probability of belonging to a specific class (e.g., forest, water, urban area). The model seems to combine traditional convolutional layers with more advanced techniques like "Locally Attention" and potentially custom information propagation mechanisms. Its primary focus is on: Multiscale Analysis: Examining the image at different levels of detail Contextual Relationships: Capturing how different parts of the image relate to each other. Important Notes Some notations might be specific to this paper. A deeper understanding likely requires reading the full text for detailed explanations of custom components. This is a high-level interpretation. The exact implementation details can be further understood by looking at the source code (if available) or a more in-depth explanation within the paper itself.

MSGCN

Mon, 26 Feb 2024 14:49:11 GMT

smoothness

Mon, 26 Feb 2024 10:07:54 GMT

In a general mathematical setting, smoothness refers to the absence of abrupt changes or sharp corners in a function, curve, or surface. It quantifies how "well-behaved" the object is when we move along it. Here's how it applies to different contexts:1. Functions: A function is considered smooth if its derivative exists and is continuous for all points in its domain. Intuitively, a smooth function has no "jumps" or "gaps" and allows for a smooth transition in its values. 2. Curves: A curve is considered smooth if it can be locally represented by a smooth function. This means for any small portion of the curve, we can find a function that accurately describes its behavior around that point. 3. Surfaces: Similar to curves, a surface is smooth if it can be locally represented by a smooth function that maps points from a parameter space (like a plane) to the surface itself. Graph data, representing relationships between entities, doesn't directly translate to functions or continuous surfaces. However, the concept of smoothness can be adapted in different ways depending on the specific context and the information we want to capture:1. Smoothness in Signal Propagation: Here, smoothness refers to the ease of information flow across a network represented by the graph. A smooth graph allows information to propagate quickly and efficiently from one node to another, with minimal "bumps" or delays. This can be measured using metrics like clustering coefficient or average path length. 2. Smoothness in Node Features: When nodes in a graph have associated features (like values or attributes), smoothness can represent the similarity between neighboring nodes based on their features. Smoothness in this context suggests that nearby nodes have similar features, forming "clusters" in the feature space. This can be quantified using measures like spectral clustering or graph Laplacian. 3. Smoothness in Graph Embeddings: In tasks like network analysis, graphs are often embedded into lower-dimensional spaces (e.g., from a complex network structure to a 2D plane). Smoothness in this context ensures that similar nodes in the original graph are also close together in the embedding space, preserving the underlying relationships. Overall, while the specific definitions of smoothness differ based on the context, the underlying idea remains the same: it captures the absence of abrupt changes or discontinuities, helping us understand the structure and behavior of data represented by graphs.

GNN

Wed, 21 Feb 2024 13:18:16 GMT

Manifold

Tue, 20 Feb 2024 10:42:58 GMT

Manifold from wolframManifold:
Basic definition: A manifold is a topological space that locally resembles Euclidean space in small enough regions. Imagine stretching a rubber sheet – although it bends and curves, each tiny patch around a point looks flat like a normal plane. Key features: Locally Euclidean: Around each point, you can find a coordinate system that makes it look like flat Euclidean space. Smoothness: This "looks-like-Euclidean" property applies smoothly as you move from one point to another. Examples: Sphere, torus, Möbius strip, complex number plane (minus the origin).
A manifold is a topological space that is locally Euclidean (i.e., around every point, there is a neighborhood that is topologically the same as the open unit ball in . To illustrate this idea, consider the ancient belief that the Earth was flat as contrasted with the modern evidence that it is round. The discrepancy arises essentially from the fact that on the small scales that we see, the Earth does indeed look flat. In general, any object that is nearly "flat" on small scales is a manifold, and so manifolds constitute a generalization of objects we could live on in which we would encounter the round/flat Earth problem, as first codified by Poincaré.More concisely, any object that can be "mapped" is a manifold.

One of the goals of topology is to find ways of distinguishing manifolds. For instance, a circle is topologically the same as any closed loop, no matter how different these two manifolds may appear. Similarly, the surface of a coffee mug with a handle is topologically the same as the surface of the donut, and this type of surface is called a (one-handled) torus.ManifoldMug
As a topological space, a manifold can be compact or noncompact, and connected or disconnected. Commonly, the unqualified term "manifold"is used to mean "manifold with boundary." This is the usage followed in this work. However, an author will sometimes be more precise and use the term open manifold for a noncompact manifold without boundary or closed manifold for a compact manifold with boundary.
If a manifold contains its own boundary, it is called, not surprisingly, a "manifold with boundary." The closed unit ball in R^n is a manifold with boundary, and its boundary is the unit sphere. The concept can be generalized to manifolds with corners. By definition, every point on a manifold has a neighborhood together with a homeomorphism of that neighborhood with an open ball in. In addition, a manifold must have a second countable topology. Unless otherwise indicated, a manifold is assumed to have finite dimension N, for N a positive integer.
Smooth manifolds (also called differentiable manifolds) are manifolds for which overlapping charts "relate smoothly" to each other, meaning that the inverse of one followed by the other is an infinitely differentiable map from Euclidean space to itself. Manifolds arise naturally in a variety of mathematical and physical applications as "global objects." For example, in order to precisely describe all the configurations of a robot arm or all the possible positions and momenta of a rocket, an object is needed to store all of these parameters. The objects that crop up are manifolds. From the geometric perspective, manifolds represent the profound idea having to do with global versus local properties.
The basic example of a manifold is Euclidean space, and many of its properties carry over to manifolds. In addition, any smooth boundary of a subset of Euclidean space, like the circle or the sphere, is a manifold. Manifolds are therefore of interest in the study of geometry, topology, and analysis.
A submanifold is a subset of a manifold that is itself a manifold, but has smaller dimension. For example, the equator of a sphere is a submanifold. Many common examples of manifolds are submanifolds of Euclidean space. In fact, Whitney showed in the 1930s that any manifold can be embedded in , where .
A manifold may be endowed with more structure than a locally Euclidean topology. For example, it could be smooth, complex, or even algebraic (in order of specificity). A smooth manifold with a metric is called a Riemannian manifold, and one with a symplectic structure is called a symplectic manifold. Finally, a complex manifold with a Kähler structure is called a Kähler manifold.

Mug_and_Torus_morph

Tue, 20 Feb 2024 10:37:48 GMT

Manifolds

Tue, 20 Feb 2024 10:25:23 GMT

topological space

Tue, 20 Feb 2024 10:21:53 GMT

Topological spaces are a fundamental concept in mathematics, but they can be intimidating at first. Let's explore them in three levels, from basic intuition to technical details:Easy:Imagine a coffee mug and a doughnut. Both are objects in 3D space, but they have different "shapes" in a way that matters for topology. You can continuously deform the mug into a doughnut without tearing or gluing anything (imagine stretching the mug handle). However, you can't deform a coffee mug into a sphere without cutting or joining parts – their "topological types" are different.Topology is like studying the "rubber sheet" properties of shapes – how they can be stretched, bent, and twisted without changing their fundamental structure. Think of it as the "shape without size or distance."Medium:Now, imagine a map of your city. Each point on the map represents a location, but the distances between points are not perfectly accurate. What matters topologically is how streets connect and neighborhoods are arranged. You can imagine stretching the map or even cutting and gluing parts, as long as the overall connections remain the same.This is the idea of a topological space – it's a collection of points (like locations on a map) where "nearness" is defined by how you can continuously move between points, not by strict distance. This allows for studying shapes and connections in a more abstract way.Hard:Technically, a topological space is defined by a set of points and a collection of subsets called "open sets" that satisfy certain axioms. These axioms capture the intuitive idea of "nearness" and allow for formal analysis of continuous deformations and connections.Further, different types of topological spaces exist, like manifolds (smooth, locally Euclidean spaces like spheres or tori), CW complexes (built from cubes and spheres), and simplicial complexes (built from triangles). Each type has additional properties that allow for deeper analysis and applications in various mathematical fields.Remember, this is a simplified overview. Topology is a vast and fascinating field with many nuances and applications. But hopefully, this gives you a taste of its core idea: studying the fundamental shapes and connections of objects, independent of size and distance.

Graph Theory

Tue, 20 Feb 2024 10:09:10 GMT

Graph Theory:Graph theory is a branch of mathematics that studies graphs, which are mathematical structures used to model pairwise relationships between objects. It focuses on the properties and behavior of these structures, exploring how objects connect and relate to each other. Think of it as the study of networks and connections.Graphs:A graph consists of two main components: Vertices: Also called nodes or points, these represent the objects or entities being studied. Think of them as individual actors in a network. Edges: These represent the connections or relationships between vertices. Imagine them as lines or links connecting the actors. Graphs can be directed (edges have a direction) or undirected (edges have no direction). Additionally, edges can be weighted (assigned a numerical value) to represent the strength or cost of the connection.Here are some important features of graphs: Order: The number of vertices in the graph. Size: The number of edges in the graph. Degree of a vertex: The number of edges connected to that vertex. Path: A sequence of connected edges leading from one vertex to another. Cycle: A path that starts and ends at the same vertex. Connectivity: Whether all vertices are connected by paths. Subgraph: A subset of vertices and edges forming a smaller graph within the original. Examples of graphs in real life: Social networks: Vertices represent people, edges represent friendships. Transportation networks: Vertices represent cities, edges represent roads or flights. Internet: Vertices represent websites, edges represent links between them. Molecules: Vertices represent atoms, edges represent chemical bonds.
There are fascinating connections between graph theory,manifolds, and topological spaces, offering different perspectives on understanding structure and relationships:Graphs and Topological Spaces:
Direct Link: Every graph can be viewed as a topological space by constructing an appropriate topology on its vertices and edges. This allows applying tools from topology to analyze graphs, like studying their connectedness, path-finding problems, and even their "shape" in a topological sense. Graphs and Manifolds: Embedding Graphs on Manifolds: A key question is whether a given graph can be "drawn" on a specific manifold (like a plane, sphere, torus) without edges crossing. This connects to graph drawing algorithms and has applications in areas like network visualization and mapmaking. Graph Homology: This advanced technique uses tools from algebraic topology to analyze graphs by looking at "cycles" and "boundaries" formed by their edges. It reveals features like the number of connected components and holes in the graph, providing a deeper understanding of its structure. Manifolds and Topological Spaces: Manifolds as Special Topological Spaces: Manifolds are a specific type of topological space with additional smoothness properties. They offer a more geometric perspective on shape and allow for richer analysis using calculus and differential geometry. Topological Invariants: Studying manifolds as topological spaces allows calculating "invariants" like their Euler characteristic, which captures their overall topological type and is independent of specific metric details. Overall:These connections provide valuable tools for analyzing complex structures: Graphs: Leverage topological and manifold perspectives to understand connectivity, layout, and underlying structure. Manifolds: Use topological tools to extract global properties and invariants based on their "shape." Topological Spaces: Offer a foundation for both graphs and manifolds, providing abstract yet powerful tools for studying relationships and connectivity. There are several ways to encode graphs numerically, depending on the type of graph, the information you want to capture, and the intended application. Here are some common approaches:
1. Adjacency Matrix: This is the most basic and widely used method. It's a square matrix where rows and columns represent vertices, and each element (i,j) indicates the presence (value 1) or absence (value 0) of an edge between the corresponding vertices. Weighted edges can be represented by assigning their weight to the corresponding matrix element. This approach is simple and efficient for basic graph operations like checking connectivity and finding paths, but it loses information about edge direction and edge weights.
2. Edge list: This method lists all edges in the graph, typically as pairs of vertices representing the connected nodes. It can be easily extended to include edge weights by adding a third element to each pair. This approach is more space-efficient for sparse graphs (fewer edges than vertices) and preserves edge direction, but it requires iterating through the entire list for some operations. 3. Incidence matrix: This matrix has rows for vertices and columns for edges. A value of 1 at (i,j) indicates that vertex i is incident to edge j. Similar to the adjacency matrix, it can represent edge weights and direction. This representation is useful for analyzing specific characteristics like vertex degrees and edge connectivity, but it can be larger and less efficient for some tasks. 4. Node and Edge Feature Vectors: In addition to structural information, graphs can contain additional data attached to nodes and edges (e.g., node attributes, edge labels). These features can be encoded numerically using one-hot encoding, embedding techniques, or other methods depending on the data type. This allows for incorporating richer information into the analysis, but it requires more complex processing and storage 5. Graph Embedding Techniques: These advanced techniques represent entire graphs as numerical vectors capturing their structural and sometimes attribute-based information. Popular methods include DeepWalk, Node2Vec, and Graph Convolutional Networks (GCNs). This allows for using machine learning algorithms on graph data, but these methods often require specific training data and can be computationally expensive. The choice of encoding method depends on your specific needs and the type of analysis you want to perform. Consider factors like data size, desired information, computational efficiency, and downstream applications when making your decision.

edge_list

Tue, 20 Feb 2024 10:07:35 GMT

adj_matrix

Tue, 20 Feb 2024 10:06:02 GMT

The Manifold Hypothesis

Tue, 20 Feb 2024 09:55:01 GMT

The Manifold hypothesis is a fascinating idea in mathematics and machine learning that says even complex data in high dimensions might actually live on "simpler" surfaces.Let's explore it in three levels:Easy:Imagine a bunch of coins scattered on a table. Even though the table is 3D (length, width, height), all the coins lie flat on its surface, which is essentially a 2D space. This illustrates the core idea: high-dimensional data (coins) may actually reside on a lower-dimensional "manifold" (tabletop) within the bigger space.Medium:Now, imagine taking pictures of different 3D objects like balls, apples, and chairs. Even though each object can be described by many points (its 3D coordinates), you could argue that all the possible pictures of these objects actually lie on a lower-dimensional manifold. Why? Because there are inherent constraints on how these objects can be shaped and photographed. This manifold captures the essential variations in the pictures without needing all the 3D details.Hard:The technical side involves advanced math concepts like manifolds, which are more general than surfaces but share similar properties. The hypothesis states that complex real-world data, like images, speech, or natural languages, lies on low-dimensional manifolds within their high-dimensional representation. This has profound implications for machine learning: Dimensionality reduction: If data really lives on a lower-dimensional manifold, we can compress it without losing significant information, making learning and analysis more efficient. Understanding data structure: Analyzing the manifold's geometry can reveal hidden relationships and patterns in the data. Developing better algorithms: Machine learning algorithms inspired by the manifold hypothesis can be more powerful and interpretable. Remember, this is a simplified explanation, and the technical details go much deeper. But hopefully, this gives you a taste of the intriguing idea that complex data might have a surprisingly simple underlying structure!In machine learning, besides the manifold hypothesis, several other data hypotheses play crucial roles in shaping how we understand and analyze data. Here are a few key examples:1. The Curse of Dimensionality: This hypothesis proposes that as the number of features in your data increases (dimensionality), the amount of data needed to make reliable predictions also increases exponentially. This poses challenges for data collection, storage, and model training in high-dimensional settings.2. The Bias-Variance Tradeoff: This principle states that there's a fundamental tradeoff between how well a model fits the training data (bias) and how well it generalizes to unseen data (variance). Low bias models can overfit, while high bias models underfit, making finding the optimal balance crucial.3. The Low-Rank Hypothesis: This hypothesis suggests that much of the information in real-world data lies in a low-dimensional subspace, even if the data itself is high-dimensional. This aligns with the manifold hypothesis but focuses on capturing the essential structure with fewer dimensions.4. The Locality Hypothesis: This hypothesis states that data points close together in the feature space are likely to have similar labels or outputs. This is often exploited in algorithms like k-Nearest Neighbors (kNN) where predictions are based on the labels of nearby data points.5. The Invariance Hypothesis: This principle proposes that relevant information in the data should be invariant to certain transformations (e.g., rotation, scaling) that don't change the underlying meaning. This motivates designing models that are robust to such transformations.6. The Sparsity Hypothesis: This hypothesis suggests that many real-world datasets are sparse, meaning most data points have only a few relevant features. This motivates using techniques like L1 regularization that encourage model weights to be zero, leading to sparse models.These are just a few examples, and the specific hypotheses relevant to your work will depend on the type of data and the problem you're trying to solve. Understanding these and other data hypotheses helps you approach machine learning problems with a critical eye and select the most appropriate methods for analysis and prediction.

Riemannian manifold

Tue, 20 Feb 2024 09:55:01 GMT

Definition: A Riemannian manifold is a specific type of Manifold equipped with an additional structure called a Riemannian metric. This metric allows you to define: Distances between points. Angles between curves. Curvature of the manifold itself. Key features: All the properties of a regular manifold. Has a Riemannian metric that defines distances, angles, and curvature. Enables geometric analysis: studying angles, areas, geodesics (shortest paths), and curvature. Examples: Earth's surface (treated as a sphere), surfaces in Einstein's general relativity. Here's the analogy: Think of a manifold as a map. It captures the overall shape and connections between places, but doesn't tell you about distances or angles. A Riemannian manifold is like a map with marked distances and directions. It provides more information about the "geometry" of the space. In summary: Every Riemannian manifold is a manifold, but not all manifolds are Riemannian. Riemannian manifolds offer richer geometric information compared to general manifolds.

graph_1

Tue, 20 Feb 2024 09:49:34 GMT

Geostatistics for Large Datasets on Riemannian Manifolds, A Matrix-Free Approach

Tue, 20 Feb 2024 09:18:16 GMT

Journal of Data Science 20 (4), 512–532 DOI: 10.6339/22-JDS1075 October 2022 Statistical Data ScienceLarge or very large spatial (and spatio-temporal) datasets have become common place in many environmental and climate studies. These data are often collected in non-Euclidean spaces (such as the planet Earth) and they often present nonstationary anisotropies. This paper proposes a generic approach to model Gaussian Random Fields (GRFs) on compact Riemannian manifolds that bridges the gap between existing works on nonstationary GRFs and random fields on manifolds. This approach can be applied to any smooth compact manifolds, and in particular to any compact surface. By defining a Riemannian metric that accounts for the preferential directions of correlation, our approach yields an interpretation of the nonstationary geometric anisotropies as resulting from local deformations of the domain. We provide scalable algorithms for the estimation of the parameters and for optimal prediction by kriging and simulation able to tackle very large grids. Stationary and nonstationary illustrations are provided.Simplified explanation:Scientists are collecting a lot of data about the Earth and other places, and this data often shows patterns that change from place to place. This paper proposes a new way to understand these patterns using GRFs on special types of surfaces. This new way can be used for any kind of surface, and it can take into account how the patterns change from place to place. We also provide ways to use this method to make predictions about the data.Nonstationary anisotropiesAnisotropies refer to the property of being directionally dependent. In the context of data analysis, it means that the relationships between data points vary depending on their relative orientation. For example, the temperature distribution on Earth is anisotropic because it is influenced by the Earth's rotation and its geography.Nonstationary anisotropies occur when the strength and direction of the anisotropy vary from place to place. For instance, the distribution of rainfall patterns across a continent may exhibit nonstationary anisotropies due to factors like mountains, deserts, and ocean currents.Riemannian manifoldA Riemannian manifold is a generalization of a smooth, n-dimensional manifold to incorporate a notion of curvature. It is equipped with a Riemannian metric. The Riemannian metric defines a notion of distance and angle on the manifold, and it also allows us to define derivatives of functions on the manifold. In simpler terms, it's a surface with a mathematical structure that allows us to define distances and angles between points on the surface.Imagine a globe as a simplified representation of Earth. The globe's surface is a two-dimensional Riemannian manifold, where distances and angles between points are defined using the standard spherical geometry.A compact Riemannian manifold is a Riemannian manifold that is compact. This means that the manifold can be enclosed in a bounded region. Compact Riemannian manifolds are often used to model physical objects that are finite in size, such as planets, stars, or galaxies.Riemannian manifolds are used to model various physical objects, including curved surfaces, volumes, and even spacetime in general relativity.Understanding Nonstationary Anisotropies on Riemannian ManifoldsThe concept of nonstationary anisotropies becomes particularly relevant when dealing with data collected on non-Euclidean spaces, such as the Earth's surface or other curved manifolds. In these cases, the traditional methods for analyzing stationary isotropic data may not be suitable.The Earth's surface is non-Euclidean because it is curved. Euclidean geometry is based on the assumption that space is flat, with parallel lines extending in the same direction forever without intersecting. However, the Earth's surface is a sphere, and on a sphere, parallel lines eventually converge. This means that Euclidean geometry cannot accurately describe the geometry of the Earth's surface.The paper you mentioned proposes a new approach to modeling nonstationary anisotropies on Riemannian manifolds using Gaussian Random Fields (GRFs). GRFs are a type of stochastic process that can be used to represent random fields with spatially varying properties.By incorporating a Riemannian metric into the GRF framework, the proposed approach allows for the modeling of nonstationary anisotropic patterns on curved manifolds. This approach has applications in various fields, including environmental science, geology, and spatial statistics.Random FieldsImagine you have a map of the world with temperatures at different locations. A random field is a way of describing how the temperature changes from place to place. It's like a mathematical model of the temperature data.One important property of random fields is whether they are stationary or not. A stationary random field means that the temperature changes in a similar way all over the world. For example, it might get colder as you move from the equator to the poles. A non-stationary random field means that the temperature changes in a different way in different parts of the world. For example, it might get colder in the winter and warmer in the summer.Another important property of random fields is whether they are isotropic or anisotropic. An isotropic random field means that the temperature changes in the same way in all directions. For example, it might get colder as you move up a mountain, no matter which way you are facing. An anisotropic random field means that the temperature changes in different ways in different directions. For example, it might get colder as you move up a mountain, but it might also get colder as you move north.Here are some of the key properties of random fields: Stationarity: A random field is stationary if its statistical properties (mean, variance, autocorrelation function) are constant over the entire domain. Isotropy: A random field is isotropic if its statistical properties are invariant under rotation. Nonstationarity: A random field is nonstationary if its statistical properties vary over the domain. Anisotropy: A random field is anisotropic if its statistical properties are not invariant under rotation. Gaussian Random FieldsA Gaussian Random Field (GRF) is a random process that maps points in some space to random values. The value of the GRF at any point depends on the values at nearby points, and this dependence is typically described by a correlation function. GRFs are often used to model spatial or temporal data, such as temperature, rainfall, or population density.Smooth Compact SurfaceA smooth compact surface is a Riemannian manifold that is topologically equivalent to a sphere. This means that the surface can be continuously deformed into a sphere without tearing or gluing. Smooth compact surfaces are used to model objects such as the Earth, the moon, and other planets.Here is a table that summarizes the key concepts:

Diffusion Models

Thu, 01 Feb 2024 08:30:45 GMT

Full free course on Deep Learning : DeepLearning.AIDiffusion Models:Diffusion models are a type of generative model that works by gradually adding noise to an image until it becomes completely gaussian noise, and then learning to reverse this process to recover the original image. This two-step process, known as denoising diffusion, enables diffusion models to capture the underlying structure of data while also learning to generate realistic details.Advantages of Diffusion Models:Diffusion models offer several advantages over other generative models, including: High quality: Diffusion models can generate high-quality images with sharper details and less blurring than other models. Faster training: Diffusion models can be trained significantly faster than other models, such as generative adversarial networks (GANs). Improved stability: Diffusion models are less prone to mode collapse, a common issue with GANs. Why UNet is Popular:UNet (U-Net), a convolutional neural network architecture, is a popular choice for diffusion models due to its ability to effectively handle image upsampling and downsampling. UNet's U-shaped architecture enables it to capture high-resolution details while preserving global context. It also preserves the original size of the image which is necessary.Embedding Context in Upsampling:To embed context in the upsampling step, diffusion models can incorporate features from lower-resolution layers into the upsampling process. This allows the model to maintain a sense of global context as it reconstructs fine-grained details, resulting in more realistic and coherent images.There are several ways to embed context into a diffusion model such as metadata. Conditional diffusion directly incorporates external information into the diffusion process using image metadata, while multimodal diffusion allows for multiple image variations based on different external inputs. Semi-supervised diffusion leverages both labeled and unlabeled data, and iterative diffusion involves refining the generation process through feedback. Finally, reinforcement learning utilizes reward signals derived from external information to train models for specific criteria.

The size of local diffusion models trained on a single GPU can vary depending on the specific model architecture, the complexity of the training data, and the desired quality of the generated images. However, as a general rule of thumb, you can expect to need a GPU with at least 12GB of VRAM to train a stable diffusion model with a decent level of quality. For more complex models or larger training datasets, you may need a GPU with 24GB or more of VRAM.Here is a table that summarizes the typical GPU requirements for training different types of local diffusion models:Keep in mind that these are just estimates, and the actual amount of VRAM you need may vary depending on your specific circumstances. If you are not sure whether you have enough VRAM, you can always try training the model with a smaller batch size or resolution. You can also use a tool like Nvidia-smi to monitor your GPU's VRAM usage and make sure you are not running out.When selecting an 8GB VRAM GPU for a diffusion model, you should consider the following criteria: Model architecture: The amount of VRAM required will vary depending on the specific model architecture. For example, Stable Diffusion (Vanilla) requires more VRAM than Stable Diffusion (LoRA), which uses a more efficient architecture. Training data complexity: More complex training data will require more VRAM than simpler data. For example, training a model on high-resolution images will require more VRAM than training a model on low-resolution images. Desired image quality: The higher the desired quality of the generated images, the more VRAM you will need. This is because the model will need to store more information in memory to generate higher-quality images. Additional features: Some GPUs have additional features that can help reduce VRAM usage, such as Xformers. Xformers are a type of neural network architecture that can be used to reduce the size of the model and the amount of VRAM required. What are Xformers?
Xformers are a type of neural network architecture that was originally developed for natural language processing (NLP). However, they have also been shown to be effective for image generation tasks. Xformers work by using self-attention mechanisms to learn long-range dependencies in the data. This allows them to capture more information from the input data, which can lead to better image generation results.How can Xformers reduce VRAM load?Xformers can reduce VRAM load by several mechanisms: They can compress the model architecture: Xformers are able to learn more information with fewer parameters than traditional neural network architectures. This can reduce the amount of memory required to store the model. They can use less memory for intermediate calculations: Xformers are able to perform many calculations in parallel, which can reduce the amount of memory required for intermediate results. They can be optimized for hardware: Xformers are well-suited for implementation on GPUs, which can further reduce their memory footprint. Overall, Xformers are a promising tool for reducing VRAM load and improving image generation performance.Yes, a Nvidia Quadro RTX 4000 would suffice for small images, like 256x256 pixels, but it may not be ideal for training larger models or higher-resolution images. The Quadro RTX 4000 has 8GB of VRAM, which is enough for training some diffusion models on small images. However, if you are planning to train larger models or use higher-resolution images, you may want to consider a GPU with more VRAM, such as the Nvidia Quadro RTX 5000 or 6000.Here is a table that summarizes the approximate VRAM requirements for training different types of local diffusion models on different image resolutions:As you can see, the amount of VRAM required increases with the image resolution. This is because the model needs to store more information in order to generate higher-resolution images.Here is a table that summarizes the VRAM requirements for training different types of diffusion models with different amounts of training data:As you can see, the amount of VRAM required increases with the size of the training data. This is because the model needs to learn more patterns from the data in order to generate high-quality images.

Context_embedding

Tue, 30 Jan 2024 14:23:18 GMT

unet_diffusion

Tue, 30 Jan 2024 13:50:00 GMT

A Brief History of Machine Learning

Tue, 30 Jan 2024 13:48:20 GMT

Machine learning (ML) has emerged as a transformative technology, revolutionizing various industries and shaping our daily lives. It enables machines to learn from data without explicit programming, enabling them to make predictions, decisions, and even creative outputs. The history of ML is a fascinating journey filled with groundbreaking ideas, challenges, and remarkable progress.Early Concepts and PioneersThe concept of machines that could learn and adapt dates back to ancient civilizations. Greek philosophers proposed ideas about self-learning machines, and Leonardo da Vinci envisioned a device that could recognize objects. However, the modern era of ML began in the 1940s and 1950s, with the advent of computers and the birth of artificial intelligence (AI).The Dawn of Machine LearningIn 1943, Warren McCulloch and Walter Pitts developed a theoretical model of neural networks, inspired by the structure of the human brain. This marked a significant step towards understanding how machines could learn from data. In 1959, Arthur Samuel coined the term "machine learning," describing a system that could improve its performance through experience.The Rise and Fall of the First AI BoomThe 1960s and early 1970s saw a surge of interest in AI, with researchers developing various machine learning algorithms. However, this initial wave of enthusiasm faced challenges, such as the complexity of these algorithms and the limitations of available datasets. As a result, the field of AI experienced a period of disillusionment, known as the "AI winter."The Renaissance of Machine LearningIn the 1980s and 1990s, machine learning experienced a resurgence, fueled by advancements in computer hardware, algorithms, and data availability. Researchers developed new methods, such as support vector machines and decision trees, which showed improved performance.The late 1990s and early 2000s witnessed the emergence of deep learning, a sub-field of machine learning inspired by the structure and function of the human brain. Deep learning algorithms, characterized by multiple layers of interconnected artificial neurons.Unlike traditional machine learning algorithms that rely on hand-crafted features, deep learning algorithms can learn these features automatically from the data itself. This ability to extract complex features from raw data has enabled deep learning algorithms to achieve remarkable success in a wide range of tasks, including image recognition, natural language processing, and speech recognition.The Role of Graphics Processing Units (GPUs)The development of deep learning has been heavily reliant on the advancement of graphics processing units (GPUs). GPUs, originally designed for processing graphics in video games, are uniquely suited for deep learning tasks due to their massive parallel processing capabilities. GPUs can perform thousands or even millions of floating-point operations per second (FLOPS), making them far more efficient for training deep learning models than traditional CPUs.The First Algorithms that Popularized Deep LearningSeveral key algorithms played a significant role in popularizing deep learning and driving its current widespread adoption. These algorithms include:
Convolutional Neural Networks (CNNs): CNNs are a type of artificial neural network that is particularly well-suited for image recognition tasks. They are inspired by the structure of the human/cats visual cortex and excel at extracting features from images, such as edges, shapes, and textures.
Recurrent Neural Networks (RNNs): RNNs are another type of artificial neural network that is designed to process sequential data. They are particularly well-suited for tasks such as natural language processing, where the order of words significantly impacts the meaning of a sentence.
Autoencoders: Autoencoders are a type of neural network that aims to learn an efficient representation of data. They are useful for dimensionality reduction, anomaly detection, and image compression. The Current Era of Machine LearningThe success of deep learning propelled machine learning into the mainstream, leading to a surge of applications in various industries, including healthcare, finance, transportation, and retail. Machine learning is now widely used for tasks such as fraud detection, customer segmentation, product recommendations, and medical diagnosis.Generative models have experienced a surge in popularity in recent years, due to a combination of factors including: Advances in deep learning: Deep learning has revolutionized artificial intelligence, enabling the development of more powerful and sophisticated generative models. The ability of deep neural networks to learn complex patterns from data has allowed generative models to produce increasingly realistic and diverse outputs. Increased availability of data: The amount of data available to train generative models has exploded in recent years. The vast amounts of data generated by the internet, social media, and other sources have provided generative models with the training data they need to learn complex patterns and generate realistic outputs. Advancing computational power: The availability of powerful computing resources has been essential for the development and training of generative models. The computational demands of training these models have increased significantly, and the availability of powerful GPUs and specialized hardware has made it possible to train more complex and sophisticated models. Demonstration of real-world applications: Generative models have demonstrated their potential to solve a wide range of real-world problems, such as generating realistic images, creating new music, and developing new drugs. These successes have helped to increase interest and investment in generative modeling research. Public awareness and interest: The rise of large language models (LLMs) like GPT-3 and LaMDA has brought generative models into the public consciousness. These LLMs can generate realistic and coherent text, and their capabilities have captured the imagination of the public. The origins of generative models can be traced back to the early days of artificial intelligence. In the 1950s, researchers began to explore the idea of using computers to generate creative content, such as text, music, and images. Early efforts in this area were limited by the computational resources available at the time, but they laid the groundwork for the development of more sophisticated generative models in the decades that followed.
In the 1980s and 1990s, researchers developed a number of new techniques for generating creative content, including genetic algorithms, simulated annealing, and Markov chains. These techniques were able to produce more realistic and varied outputs than earlier methods, but they were still limited in their ability to capture complex patterns and generate truly creative content.The breakthrough that led to the rise of modern generative models came in the early 2000s with the development of deep learning. Deep neural networks, which are inspired by the structure of the human brain, are able to learn complex patterns from data with unprecedented accuracy. This has enabled the development of generative models that can generate increasingly realistic and diverse outputs, across a wide range of domains.As generative models continue to improve, they are poised to revolutionize a wide range of industries, including: Art and design: Generative models can be used to create new art and design forms, such as photorealistic images, realistic 3D models, and new musical compositions. Drug discovery: Generative models can be used to design new drugs and therapies, by generating molecular structures with the desired properties. Financial modeling: Generative models can be used to model financial markets and predict future trends, by analyzing historical data and generating new scenarios. Education: Generative models can be used to personalize learning experiences, by creating adaptive teaching materials and providing intelligent feedback to students. The future of generative models is bright, and they are likely to play an increasingly important role in our lives in the years to come. As these models continue to evolve, we can expect to see even more groundbreaking applications that transform how we work, interact, and create.Several types of generative models can be used to generate tabular artificial data, each with its own strengths and weaknesses. Here are a few of the most common:
1. Generative Adversarial Networks (GANs)GANs are a powerful type of generative model that involves two neural networks: a generator and a discriminator. The generator is responsible for creating new data samples, while the discriminator is responsible for distinguishing between real data samples and generated data samples. The two networks are trained in an adversarial manner, with the generator trying to fool the discriminator into thinking that its generated data is real.GANs are a versatile method for generating tabular data, and they have been shown to be effective at producing realistic data that resembles the training data. However, GANs can be difficult to train, and they are prone to producing artifacts or mode collapse.
2. Variational Autoencoders (VAEs)VAEs are another popular type of generative model that learns a latent representation of the data. The latent representation is a compressed summary of the data that captures the underlying patterns and relationships between the data points. VAEs can then be used to generate new data samples by sampling from the latent representation and then decoding the sampled latent vectors into data points.VAEs are a more stable and interpretable method than GANs, and they are less prone to producing artifacts. However, VAEs can sometimes struggle to capture the full diversity of the data.
3. Diffusion ModelsDiffusion models are a newer type of generative model that work by inverting the diffusion process that created the data. The diffusion process is a process of gradually degrading high-quality data into noise. Diffusion models can be used to generate new data samples by starting with high-quality noise and then gradually refining it until it resembles the training data.Diffusion models are a promising new method for generating tabular data, and they have shown to be effective at producing realistic data that is similar to the training data. However, diffusion models can be computationally expensive to train.In addition to these three methods, there are a number of other generative models that can be used for generating tabular data, such as autoregressive models and recurrent neural networks. The best method for a particular task will depend on the specific characteristics of the data and the desired properties of the generated data. Here is a table summarizing the strengths and weaknesses of each of the methods discussed above:Generating both an image and synthetic metadata simultaneously is a highly sought-after capability in the field of artificial intelligence, with applications across various domains. Diffusion models and variational autoencoders (VAEs) emerge as promising techniques to achieve this goal.Diffusion models excel at image generation by denoising noisy images, gradually refining them until they resemble the desired output. This process effectively encodes the underlying structure and patterns of the data, enabling the generation of both images and their corresponding metadata.VAEs, on the other hand, employ a different approach by encoding images into latent representations. These latent representations capture the essence of the image, including its features, style, and context. By manipulating these latent vectors, VAEs can generate new images with associated metadata, effectively bridging the gap between image generation and metadata synthesis.Machine learning has come a long way since its early conceptualizations, transforming from a niche field to a ubiquitous technology. Today, machine learning is revolutionizing industries, shaping our daily lives, and opening up new frontiers in science and technology. As this exciting field continues to evolve, we can expect even more groundbreaking advancements and far-reaching impacts in the years to come.

Human-in-the-Loop Machine Learning Models

Thu, 25 Jan 2024 15:02:48 GMT

In traditional machine learning, the model is trained entirely on data, without any human intervention. However, this approach can have limitations, particularly when dealing with complex or ambiguous data. Expert-in-the-loop (EITL) machine learning aims to address these limitations by incorporating human expertise into the training process.In EITL (or HITL), human experts interact with the machine learning model in a feedback loop. The model suggests potential solutions or decisions, and the experts provide feedback, correcting errors or refining the model's understanding of the problem. This interaction allows the model to learn from human expertise and improve its performance over time.
HITL (Human-in-the-Loop) workflows can be considered online learning. This is because they involve the continuous adaptation of a machine learning model to new data and feedback. In traditional machine learning, the model is trained on a static dataset of data and then deployed without further adaptation. However, in HITL workflows, the model is constantly being updated as new data becomes available and human experts provide feedback on the model's predictions. Supervised Learning with Human Feedback: In supervised learning, the model is trained on labeled data. In EITL, human experts provide feedback on the model's predictions, helping it to refine its understanding of the labels. Interactive Machine Learning (IML): IML involves more frequent and focused interactions between the human and the machine. The human can guide the model through the learning process, providing input on specific data points or suggesting new features to consider. Active Learning: In active learning, the human decides which data points to label for the model to learn from. This allows the human to focus on the most informative data, ensuring that the model is learning efficiently. There are several open-source implementations of human-in-the-loop (HITL) workflows available. Here are a few examples: Encord: Encord is a collaborative, active learning suite of solutions for computer vision models, but it can also be used for other types of HITL processes. It provides a platform for annotators to label data, and it also includes tools for managing and evaluating the quality of the annotations. Refinery: Refinery is a tool for scaling, assessing, and maintaining natural language data. It treats training data like a software artifact, allowing you to track changes, review annotations, and collaborate with other users. Human-Lambdas: Human-Lambdas is an open-source platform for running your own private Mechanical Turk. It allows you to easily create HITL workflows and manage tasks and workers. PlayML: PlayML is a platform for creating and running interactive machine learning experiments. It supports a variety of HITL techniques, such as active learning, explainability, and human-in-the-loop decision making. AutoML Studio: AutoML Studio is a cloud-based platform for automating machine learning workflows. It includes a number of HITL features, such as active learning and model explainability. These are just a few examples of the many open-source implementations of HITL workflows available.
Active learning and human-in-the-loop (HITL) machine learning are both approaches that aim to improve the performance of machine learning models by incorporating human expertise into the learning process. However, there are some key differences between the two approaches.Active learning is a specific technique for selecting the most informative data points to label for the model to learn from. The goal is to maximize the model's learning while minimizing the amount of human labeling effort. Active learning algorithms use a variety of strategies to predict which data points are most likely to be informative, and they can also incorporate feedback from the human annotator to refine their predictions.HITL is a broader term that encompasses any approach that involves human interaction with a machine learning model. This can include active learning, but it can also include other techniques such as interactive machine learning (IML), where the human and the model work together to solve a problem, or explainability, where the human tries to understand how the model makes its decisions.In general, active learning is a more efficient approach to HITL, as it can minimize the amount of human labeling required to achieve a given level of performance. However, HITL can be more effective in cases where the human expertise is more valuable than the data itself.In addition to active learning and HITL, there are several other approaches that involve experts in the training process of machine learning models. These approaches can be broadly classified into three categories:
Interactive machine learning (IML): This approach involves more frequent and focused interactions between the human and the machine. The human can guide the model through the learning process, providing input on specific data points or suggesting new features to consider.
Explainable machine learning(XAI): This approach aims to make machine learning models more transparent and interpretable. This can be done by providing explanations for the model's predictions, or by identifying the features that the model is using to make decisions. Human-in-the-loop decision making (HILDM): This approach involves using machine learning models to generate suggestions or recommendations, and then having humans make the final decision. Here are some examples of specific techniques that fall into these categories: Iterative model refinement: The human provides feedback on the model's predictions, and the model is updated accordingly. Human-in-the-loop optimization: The human suggests new features or parameters for the model, and the model is retrained to optimize those features or parameters. Interactive feature selection: The human selects the features that they think are most important, and the model is trained on those features.
Explainable AI (XAI) techniques: Feature importance, saliency maps, counterfactual explanations, and decision trees. Human-in-the-loop reinforcement learning (HI-RL): The human provides rewards or penalties to the model, and the model learns to act in a way that maximizes the rewards. These are just a few examples of the many approaches that involve experts involved in the training process of machine learning models. The best approach for a particular application will depend on the specific task at hand and the availability of human expertise.These methods do not necessarily use deep learning. They can be used with any type of machine learning model, including traditional models such as random forests and XGBoost. In fact, HITL and other expert-in-the-loop approaches can be particularly beneficial for traditional models, as they can help to overcome some of the limitations of these models, such as their limited ability to interpret data or handle complex tasks.Here are some specific examples of how HITL and other expert-in-the-loop approaches can be used to improve the performance of traditional models: Random forests: HITL can be used to improve the performance of random forests by providing the model with more informative data points to split on. This can be done by having humans select the data points that they think are most important, or by using active learning algorithms to select the data points that are most likely to improve the model's performance. XGBoost: HITL can be used to improve the performance of XGBoost by providing the model with more accurate predictions to correct. This can be done by having humans review the model's predictions and provide feedback, or by using algorithms that detect and correct errors in the model's predictions. In general, HITL and other expert-in-the-loop approaches can be a powerful tool for improving the performance of machine learning models, regardless of the type of model being used. These approaches can help to overcome the limitations of traditional models and make them more effective in real-world applications.

fuzzy_clustering

Tue, 23 Jan 2024 16:11:49 GMT

DBSCAN_tutorial

Tue, 23 Jan 2024 15:21:52 GMT

hierarch

Tue, 23 Jan 2024 15:20:36 GMT

k_means_animation

Tue, 23 Jan 2024 15:17:03 GMT

dendrogram

Tue, 23 Jan 2024 14:31:46 GMT

Source : Hierarchical Clustering / Dendrogram: Simple Definition, Examples
Hierarchical Clustering is where you build a cluster tree (a dendrogram) to represent data, where each group (or “node”) links to two or more successor groups. The groups are nested and organized as a tree, which ideally ends up as a meaningful classification scheme.It is an unsupervised machine learning method, which means the algorithm learns from the data itself to identify the underlying structure of the clusters.
In comparison, a superviseddecision tree is a tree-like structure used in supervised learning, where the algorithm learns from labeled data to classify new data points into predefined categories or classes. Each node in the decision tree represents a decision based on a specific feature, and the branches lead to subsequent decisions or leaf nodes, which represent the predicted class labels.Each node in the cluster tree contains a group of similar data; Nodes group on the graph next to other, similar nodes. Clusters at one level join with clusters in the next level up, using a degree of similarity; The process carries on until all nodes are in the tree, which gives a visual snapshot of the data contained in the whole set. The total number of clusters is not predetermined before you start the tree creation. In opposition to regular decision trees which have predetermined branches and nodes.

hierarchical clusteringA dendrogram (right) representing nested clusters (left).
A dendrogram is a type of tree diagram showing hierarchical clustering — relationships between similar sets of data. They are frequently used in biology to show clustering between genes or samples, but they can represent any type of grouped data.

dendrogram A dendrogram can be a column graph (as in the image above) or a row graph. Some dendograms are circular or have a fluid-shape, but software will usually produce a row or column graph. No matter what the shape, the basic graph comprises of the same parts: The clade is the branch. Usually labeled with Greek letters from left to right (e.g. α β, δ…). Each clade has one or more leaves. The leaves in the above image are: Single (simplicifolius): F Double (bifolius): D E Triple (trifolious): A B C A clade can theoretically have an infinite amount of leaves. However, the more leaves you have, the harder the graph will be to read with the naked eye.
The clades are arranged according to how similar (or dissimilar) they are. Clades that are close to the same height are similar to each other; clades with different heights are dissimilar — the greater the difference in height, the more dissimilarity (you can measure similarity in many different ways; One of the most popular measures is Pearson’s Correlation Coefficient).

dendrogram2 Leaves A, B, and C are more similar to each other than they are to leaves D, E, or F. Leaves D and E are more similar to each other than they are to leaves A, B, C, or F. Leaf F is substantially different from all of the other leaves. Note that on the above graph, the same clave, β joins leaves A, B, C, D,and E. That means that the two groups (A,B,C & D,E) are more similar to each other than they are to F.
All hierarchical clustering algorithms are monotonic — they either increase or decrease. The algorithms can be bottom up or top down:
1. Bottom up (Hierarchical Agglomerative Clustering, HAC): Treat each document as a single cluster at the beginning of the algorithm. Merged(agglomerate) two items at a time into a new cluster. How the pairs merge involves calculating a dissimilarity between each merged pair and the other samples. There are many ways to do this. Popular options:
Complete linkage: similarity of the farthest pair. One drawback is that outliers can cause merging of close groups later than is optimal. Single-linkage: similarity of the closest pair. This can cause premature merging of groups with close pairs, even if those groups are quite dissimilar overall.
Group average: similarity between groups. Centroid similarity: each iteration merges the clusters with the most similar central point. The pairing process continues until all items merge into a single cluster.
HAC’s account for the vast majority of hierarchical clustering algorithms. However, one downside is that they have significant computational and storage requirements — especially for big data. These complex algorithms are about quadruple the size of the K-means algorithm. Also, merging can’t be reversed, which can create a problem if you have noisy, high-dimensional data.2. Top down (Divisive Clustering): Data starts as one combined cluster. The cluster splits into two distinct parts, according to some degree of similarity. Clusters split into two again and again until the clusters only contain a single data point. Divisive clustering is very rarely used.Hierarchical clustering can easily lead to dendrograms that are just plain wrong. Unless you known your data inside out (pretty much impossible for big data sets), this is largely unavoidable. One of the main reasons for this is that the clustering algorithm will work even on the most unsuitable data. Another reason is that the decision you make for creating clusters (Step 2 above) can lead to significantly different dendrograms. The choice can be tough to make in advance, and you may not be able to tell which of the four end results are the most suitable.
The fact that the hierarchical clustering algorithm will work even if presented with seemingly unrelated data can be a positive as well as a negative. For example, a 2003 research team used hierarchical clustering to “support the idea that many…breast tumor subtypes represent biologically distinct disease entities.” To the human eye, the original data looked like noise, but the algorithm was able to find patterns.In summary, dendrograms and supervised decision trees are both powerful tools for analyzing and understanding data. Dendrograms are well-suited for exploratory data analysis and uncovering hidden patterns in unlabeled data, while supervised decision trees excel in classification tasks and accurately assigning data points to predefined categories.

dendrogram_example_1

Mon, 22 Jan 2024 16:19:26 GMT

example_decision_tree

Mon, 22 Jan 2024 16:18:57 GMT

Static Web site

Mon, 22 Jan 2024 11:20:08 GMT

A static site is a website that is made up of pre-built HTML pages that are stored on a web server. When a user visits a static site, the web server delivers the HTML page to the user's browser without any processing. This is in contrast to dynamic websites, which are generated on the fly by a web server based on user input or database queries.The term "static" in this context means that the content of the website is not changing. Once the HTML pages are created, they remain the same until they are manually updated. This makes static sites very fast and efficient, as there is no need for the web server to do any processing each time a page is requested.Static sites are also very secure, as they do not have a backend server that can be exploited by hackers. Additionally, static sites are very easy to maintain, as there is no need to worry about server-side code or security updates.Here are some of the benefits of using static sites: Fast loading times: Static sites load very quickly, as the HTML pages are already stored on the web server. Enhanced security: Static sites are less susceptible to security vulnerabilities than dynamic sites. Reduced hosting costs: Static sites can be hosted on inexpensive cloud storage or CDNs. Easy to maintain: Static sites are very easy to maintain, as there is no need to worry about server-side code or security updates. Here are some of the limitations of using static sites: No dynamic content: Static sites cannot display dynamic content, such as real-time data or user-generated content. Less interactive: Static sites are less interactive than dynamic sites, as they cannot support features such as logins, forms, or shopping carts. More difficult to update: Static sites can be more difficult to update than dynamic sites, as you need to re-generate the HTML pages each time you make a change to your content. Despite these limitations, static sites are a good choice for many websites, especially those that are focused on displaying static content such as blog posts, portfolios, or landing pages.A static site generator (SSG) is a software application that takes content and converts it into static HTML files that can be served directly to a web browser. This is in contrast to traditional web development, where content is stored in a database and dynamically generated when a user requests a page. Static sites are typically faster to load and more secure than dynamic sites, as there is no need for a backend server to process each request.Benefits of using a static site generator: Faster loading times: Static sites are typically much faster to load than dynamic sites, as the HTML pages are pre-generated and do not require any server-side processing. Enhanced security: Static sites are less susceptible to security vulnerabilities than dynamic sites, as there is no need for a backend server to be exposed to the internet. Reduced hosting costs: Static sites can be hosted on inexpensive cloud storage or CDNs, as there is no need for a dedicated hosting plan. Better SEO: Static sites are often better optimized for search engines than dynamic sites, as search engines can easily crawl and index the static HTML pages. Examples of static site generators: Jekyll: Jekyll is a popular static site generator that is written in Ruby. It is known for its simplicity and ease of use. Gatsby: Gatsby is a static site generator that is based on React. It is known for its performance and ability to generate complex websites. Hugo: Hugo is a static site generator that is written in Go. It is known for its speed and flexibility. Next.js: Next.js is a static site generator that is also a JavaScript framework. It is known for its ability to generate both static and server-side rendered sites. DocuSaurus : Docusaurus is a document site creation tool developed by Facebook.While StackEdit (- StackEdit : Convert MarkDown to HTML in a simple way, open source) is a markdown editor that can output HTML files, and Pandoc is a command-based document conversion tool, Docusaurus is a Static Site Generator.Therefore, Docusaurus is more useful for creating multi-page websites than for creating a single HTML page.The features of Docusaurus and what I like about it are as follows. Static Site Generator. Good at building documentation sites. It can convert markdown files into static sites. It supports React. It has features for multilingual sites. It has a document version control feature. Supports a site search function (Algolia documentation search) Has a deployment function Although it is a tool for building document sites, it can also be used to create independent pages such as home page, company profiles (like "page" in WordPress) , and blog posts. When to use a static site generator:Static site generators are a good choice for websites that require: Fast loading times: If your website has a lot of content or is heavily trafficked, a static site generator can help to improve loading times. Enhanced security: If your website is a target for hackers, a static site generator can help to protect it from security vulnerabilities. Reduced hosting costs: If you are on a tight budget, a static site generator can help you to save money on hosting costs. Better SEO: If you want your website to rank well in search engine results pages (SERPs), a static site generator can help you to improve your SEO. Here are some examples of websites that use static site generators: GitHub Pages: GitHub Pages is a static site hosting service that is used by many developers to host their projects. Medium: Medium is a blogging platform that uses a static site generator to generate its blog posts. Wikipedia: Wikipedia's mobile website is a static site that is generated from its main website. If you are considering using a static site generator for your website, there are a few things to keep in mind: Static sites are not good for websites that require dynamic content. For example, if your website needs to display real-time data or user-generated content, a static site generator is not the best choice. Static sites can be more difficult to update than dynamic sites. This is because you will need to re-generate the static HTML pages each time you make a change to your content. Despite these limitations, static site generators are a powerful tool that can be used to create fast, secure, and SEO-friendly websites. If you are looking for a way to improve your website's performance and reduce your hosting costs, I encourage you to consider using a static site generator.

Moran's I

Thu, 18 Jan 2024 11:19:08 GMT

#spatial
Moran's I and Geary's C are two statistical measures used to assess the strength and direction of spatial autocorrelation. Spatial autocorrelation is the tendency for similar values of a variable to occur in close proximity to each other. Moran's I measures the strength of positive autocorrelation, while Geary's C measures the strength of negative autocorrelation.Moran's I is calculated as follows:Moran's I = (Σwi(zi - z¯)(zj - z¯)) / (nS2) where: wi is the weight given to observation i zi is the value of the variable at observation i z¯ is the mean of the variable S2 is the variance of the variable Moran's I can take on values between -1 and 1. A value of 1 indicates perfect positive autocorrelation, a value of 0 indicates no autocorrelation, and a value of -1 indicates perfect negative autocorrelation.

Geary's C

Thu, 18 Jan 2024 11:18:57 GMT

#spatial Geary's C is calculated as follows:Geary's C = (Σ(di - d¯)^2 / (nS2)) where: di is the distance between observations i and j d¯ is the mean distance between all pairs of observations Geary's C can also take on values between -1 and 1. A value of 0 indicates no autocorrelation, and values closer to 1 indicate stronger negative autocorrelation.

Spatial modelling

Thu, 18 Jan 2024 11:12:11 GMT

#spatial #models
Spatially autocorrelated training and validation samples inflate performance assessment of convolutional neural networks Summary :Convolutional neural networks (CNNs) are powerful tools for remote sensing applications, but their performance can be significantly impacted by spatial autocorrelation, a phenomenon where nearby observations are more similar than distant ones. When spatial autocorrelation is not accounted for in cross-validation, it can lead to over-optimistic evaluation of CNN models, giving a false impression of their generalization ability. To address this issue, spatial cross-validation techniques are employed, which create independent training and validation sets by spatially blocking or buffering observations. The authors demonstrate the effectiveness of spatial cross-validation in a case study on tree species segmentation, highlighting its importance for accurate assessment of CNN models in remote sensing applications.(gpt4): Spatial autocorrelation can also impact the performance of decision tree algorithms in remote sensing applications. Decision trees are tree-like structures that recursively partition the feature space based on decision rules. However, when spatial autocorrelation is present, nearby observations are likely to share similar decision rules, leading to overfitting and sub-par performance on unseen data. To mitigate this issue, spatial cross-validation techniques can be applied to decision trees, similar to how they are used for CNNs. These techniques ensure that training and validation sets are spatially independent, preventing overfitting and providing a more accurate assessment of model generalization ability.The study used a multicopter to capture RGB orthoimages of 47 forest sites. The orthoimages were created between 2017 and 2019 and cover a variety of conditions, including different illumination conditions, vegetation status, forest structural characteristics, and site characteristics. The orthoimages were then cropped into non-overlapping tiles and used to train a CNN-based segmentation model to classify each pixel in a tile into one of the target tree species. The masks used for training were created from polygons available for all targeted species, which were created with visual interpretation from imagery aided with ground observations. The entire dataset, including orthoimagery, tree-species delineations, and its metadata, is openly accessible.The authors investigated the degree of optimism in tree species classification models trained on spatially autocorrelated training data. They found that optimism occurs across small and large sample sizes and that model regularization via data augmentation can help to reduce optimism. They evaluated different model setups with random and block cross-validation and found that block cross-validation is more effective at reducing optimism.
The authors used a variational autoencoder (VAE) to quantify the spatial autocorrelation between image tiles. The VAE was trained on a dataset of image tiles from 47 forest sites. The latent representation of each image tile was then used to calculate the correlation between the tile and its neighbors. The authors found that the spatial autocorrelationbetween image tiles was strong, especially at short distances. They also found that the spatial autocorrelation of image tiles was similar to the spatial autocorrelation of tree species cover.Here are some of the key points from the text: Variational autoencoders can be used to quantify the spatial autocorrelation between high-dimensional image-type observations. The spatial autocorrelation between image tiles is strong, especially at short distances. The spatial autocorrelation of image tiles is similar to the spatial autocorrelation of tree species cover.

Knowledge_graph_schematic

Wed, 17 Jan 2024 13:49:23 GMT

Hybrid Models

Tue, 16 Jan 2024 15:47:26 GMT

#models
Combining a Convolutional Neural Network (CNN) for feature extraction with a custom classifier like XGBoost for class prediction offers a powerful strategy for improving the performance of machine learning models. This approach leverages the strengths of both methods to achieve better generalization and accuracy.CNN for Feature Extraction:CNNs excel in extracting high-level features from data, especially in image and video analysis. They are well-suited for identifying patterns and relationships within the input data, producing a rich representation of the underlying structure.XGBoost for Class Prediction:XGBoost, an ensemble learning algorithm, excels in capturing complex relationships between features and the target variable. It builds an expressive tree ensemble that can effectively learn from large datasets and handle high-dimensional data.Combining CNN and XGBoost:The combination of CNN and XGBoost works in two phases: Feature Extraction: The CNN is used as a feature extractor, learning the intricate patterns and relationships within the data. It transforms the raw input data into a more meaningful and compact representation. Class Prediction: The XGBoost classifier receives the extracted features from the CNN and performs the actual classification task. It leverages its strong pattern recognition ability to predict the class labels for the input data. By separating feature extraction from class prediction, this hybrid approach benefits from the strengths of both methods: CNN Extracts Robust Features: The CNN's ability to identify meaningful patterns in the data ensures that the extracted features are relevant and informative for classification. XGBoost Captures Complex Relationships: XGBoost's ability to handle complex relationships allows it to effectively learn from the extracted features, leading to more accurate class predictions. In summary, the combination of CNN and XGBoost offers a synergistic approach to machine learning, leveraging the strengths of each method to achieve better generalization and accuracy, especially for complex tasks involving high-dimensional data.

Deep SDM

Tue, 16 Jan 2024 15:22:50 GMT

#models
Deep Species Distribution Modeling (Deep-SDM) is a numerical tool that combines the capabilities of species distribution models (SDMs) and deep learning to predict the ecological preferences and potential distributions of species based on correlations between geolocated presences (and possibly absences) and environmental predictors 2. The key advantage of Deep-SDMs is their ability to capture the spatial and temporal structure of landscapes.

local_predictions

Mon, 15 Jan 2024 08:50:32 GMT

local_correct

Mon, 15 Jan 2024 08:50:22 GMT

predicted_map_xgb

Mon, 15 Jan 2024 08:49:20 GMT

Expected_map

Mon, 15 Jan 2024 08:48:47 GMT

Plant intelligence metaheuristic optimization algorithms

Wed, 20 Dec 2023 14:32:09 GMT

metaheuristic methods

Wed, 20 Dec 2023 14:31:05 GMT

Particle Swarm Optimization

Wed, 20 Dec 2023 14:23:17 GMT

Particle swarm optimization (PSO) is a Metaheuristic Optimization algorithm inspired by the social behavior of bird flocks or fish schools. It simulates the collective behavior of these groups to find the optimal solution to a problem.

(source : https://machinelearningmastery.com/a-gentle-introduction-to-particle-swarm-optimization/)Key Concepts of PSO: Particles: Each particle represents a potential solution to the problem. It has a position and a velocity, which are continuously updated to guide it towards better solutions. Global Best and Personal Best: Each particle keeps track of its own best solution (personal best) and the best solution found by any particle in the swarm (global best). These values are crucial for guiding the particles towards better solutions. Velocity Update: The velocity of each particle is updated based on its current position, its personal best, and the global best. This update rule allows the particles to explore the search space and converge towards the optimal solution. Inertia and Cognitive and Social Components: The velocity update rule incorporates three components: inertia, cognitive, and social. Inertia keeps the particles moving in their current direction, cognitive attracts them towards their personal best, and social attracts them towards the global best. Steps of PSO: Initialize Population: Randomly generate a population of particles, each with a position and velocity. Evaluate Fitness: Calculate the fitness value (quality) of each particle's solution. Update Personal Best: If a particle's current fitness is better than its personal best, update its personal best position. Update Global Best: If a particle's current fitness is better than the global best, update the global best position. Update Velocities: Update the velocities of each particle using the velocity update rule, incorporating the three components. Repeat Steps 2-5: Repeat the process for a specified number of iterations or until a stopping criterion is met. Benefits of PSO: Simplicity and Efficiency: PSO is relatively easy to implement and computationally efficient, making it applicable to a wide range of problems. Relevance to Nature: The biological inspiration makes it intuitive and easier to understand, compared to other optimization algorithms. Adaptability to Search Space: PSO can effectively navigate complex search spaces, making it suitable for problems with multiple local optima. Applications of PSO: Optimization: PSO has been widely used in various optimization problems, including engineering design, resource allocation, and financial modeling. Machine Learning: PSO has been employed to optimize hyperparameters of machine learning models, such as the learning rate of neural networks. Scheduling and Routing: PSO has been applied to optimize scheduling tasks and find efficient routing solutions in transportation problems. Overall, particle swarm optimization is a versatile and powerful metaheuristic algorithm that has found widespread applications in various fields. Its simplicity, efficiency, and adaptability make it a valuable tool for solving optimization problems across diverse domains.

Fungal Kingdom Expansion Algorithm

Mon, 18 Dec 2023 15:06:16 GMT

There is a novel optimization algorithm inspired by the growth and self-organizing behavior of mycelium networks known as the FKE algorithm (Fungal Kingdom Expansion Algorithm). This algorithm mimics the adaptive and exploratory nature of mycelium networks to find solutions to optimization problems.Key Concepts of the FKE Algorithm: Mycelial Network Representation: The problem space is represented as a network of nodes and edges, analogous to the interconnected structure of mycelium networks. Hypha Expansion: Nodes in the network represent potential solutions, and edges represent connections between solutions. Hyphae, the filamentous structures of fungi, are used to represent search trajectories in the network. Energy Evaluation: Each hypha is assigned an energy value, which represents its fitness or desirability as a solution. The algorithm aims to find the hyphal network with the lowest overall energy. Mass Expansion and Energy Update: Hyphae grow or shrink based on their energy values, guided by a mass expansion mechanism. This simulates the adaptive growth of mycelium networks towards favorable conditions. Global Best and Local Best: The algorithm maintains two best solutions: the global best, which represents the overall best solution found so far, and the local best, which represents the best solution found within a local region of the network. Steps of the FKE Algorithm: Initialize Network: Randomly generate a network of nodes and edges, representing the problem space. Assign Initial Energies: Assign random energy values to each node in the network. Hypha Growth: Randomly select a node and create a new hypha from it. The hypha extends along the edges of the network, growing towards nodes with lower energy values. Energy Update: Update the energy values of the nodes along the hypha based on their connections to other nodes. Mass Expansion: Apply mass expansion to the hypha, where the hypha grows or shrinks depending on the energy values of the nodes it touches. Check Termination: If the desired convergence criteria are met, stop the algorithm. Otherwise, repeat steps 3-5. Select Global and Local Best: Update the global and local best solutions based on the current network configuration. Benefits of the FKE Algorithm: Biomimicry: The algorithm draws inspiration from the natural behavior of mycelium networks, making it more biologically plausible and adaptable to complex problems. Self-Organizing Nature: The FKE algorithm exhibits self-organizing properties, allowing it to adapt and refine solutions without explicit guidance. Parallel Processing: The algorithm's parallel nature enables efficient exploration of the search space, making it suitable for large-scale problems. Potential for Innovation: The FKE algorithm has the potential to lead to novel optimization techniques for a wide range of problems. The FKE algorithm offers a promising approach to optimization, inspired by the remarkable adaptability and efficiency of mycelium networks. Its self-organizing nature, parallel processing capabilities, and biological inspiration make it a valuable tool for tackling complex optimization problems in various domains.Other bio-inspired papers :Biology-based algorithms are categorized into two dominant groups: evolutionary algorithms and bio-based/swarm intelligence techniques. Evolutionary algorithms are simulated Darwin’s theory of evolution. A genetic algorithm (GA) was the first evolutionary algorithm proposed by John Holland 4. On the other hand, the second group of biology-based algorithms includes bio/swarm intelligence-based algorithms that can be sub-categorized into seven classes based on their behavior of
Wild animals like grey wolf optimizer 5, camel algorithm 6, and wild horse optimizer 7.
Aquatic animals such as whale optimization algorithm and salp swarm search].
Insects like ant colony optimization 9,10 and moth search algorithm 11.
Birds such as particle swarm optimization (PSO) 12 which has widely been used for antenna applications in recent years 13,14, bat algorithm 15, and African vultures optimization algorithm 16.
Plant such as plants tree growth optimization algorithm 17 and smart flower optimization algorithm.
Viruses as virus colony search 18 and coronavirus herd immunity optimizer 19.
Human body parts such as heart optimization algorithm 20 and kidney algorithm 21.

pso9

Mon, 18 Dec 2023 14:02:09 GMT

ConvXGB

Mon, 18 Dec 2023 13:43:29 GMT

LightGBM

Mon, 18 Dec 2023 09:39:49 GMT

LightGBM and XGBoost are both Gradient Boosting machines (GBMs) that are widely used for machine learning tasks such as classification and regression. They are both very powerful algorithms and can achieve state-of-the-art results on many datasets.CommonalitiesBoth algorithms are based on the concept of gradient boosting, which is an ensemble learning method that combines multiple weak learners to create a strong learner. The weak learners in both algorithms are decision trees, which are simple tree-like structures that can be used to represent complex relationships in data.DifferencesDespite their similarities, there are some key differences between LightGBM and XGBoost.Tree Growth XGBoost: XGBoost grows trees horizontally, which means that it adds new nodes to the existing tree. This can lead to deeper trees, which can be more complex and expressive but can also be more prone to overfitting. LightGBM: LightGBM grows trees vertically, which means that it adds new leaves to the existing tree. This can lead to flatter trees, which can be more efficient and less prone to overfitting. In the context of gradient boosting, tree growth refers to the process of adding new nodes or leaves to a decision tree. There are two main approaches to tree growth: vertical and horizontal.Vertical tree growth involves adding new leaves to the existing tree, gradually expanding the tree's depth. This approach is often used in algorithms like LightGBM, which prioritizes finding the best split at each leaf, regardless of the level of the tree.Horizontal tree growth involves adding new nodes to existing levels of the tree, making the tree wider. This approach is often used in algorithms like XGBoost, which considers all features at each level when selecting the best split.In general, vertical tree growth tends to lead to flatter trees with fewer nodes, while horizontal tree growth tends to lead to deeper trees with more nodes. This difference in structure can have implications for the performance and interpretability of the model.Leaf-wise vs. Level-wise GrowthTree growth type refers to the specific strategy used to select the best split when growing the tree. There are two main approaches: leaf-wise and level-wise.Leaf-wise tree growth involves selecting the leaf with the greatest reduction in loss as the potential split point. This approach is often used in algorithms like LightGBM, as it can effectively reduce the overall loss of the tree.Level-wise tree growth involves considering all leaves at a particular level of the tree and selecting the one with the greatest reduction in loss as the split point. This approach is often used in algorithms like XGBoost, as it can help to prevent overfitting by ensuring that each level of the tree is well-balanced.In general, leaf-wise tree growth tends to lead to more accurate models, while level-wise tree growth tends to lead to more stable models. XGBoost: XGBoost uses level-wise growth, which means that it considers all features at each level of the tree when selecting the best split. This can be computationally expensive, especially for large datasets. LightGBM: LightGBM uses leaf-wise growth, which means that it only considers a subset of features at each leaf of the tree when selecting the best split. This can make LightGBM much more efficient than XGBoost, especially for large datasets. Gradient-based One-Side Sampling (GOSS) LightGBM: LightGBM uses Gradient-based One-Side Sampling (GOSS) to reduce the variance of the gradients. This can make LightGBM more stable and less prone to overfitting. XGBoost: XGBoost does not use GOSS. Memory Consumption LightGBM: LightGBM is generally more memory-efficient than XGBoost, making it better suited for training on large datasets. PerformanceIn general, LightGBM is slightly faster than XGBoost, especially for large datasets. However, the performance difference is not always significant, and both algorithms can achieve very good results.InterpretabilityXGBoost is generally more interpretable than LightGBM, as it is easier to understand the decision rules that are learned by the trees. However, LightGBM has some features that can make it more interpretable, such as the use of feature importance measures.ConclusionThe choice of whether to use LightGBM or XGBoost depends on the specific use case and data characteristics. LightGBM is generally a good choice for large datasets and when computational efficiency is important. XGBoost may be a better choice for smaller datasets or when interpretability is crucial.Here is a table summarizing the key differences between LightGBM and XGBoost:memory efficiency and performance depends on parameters; take with grain of saltBoth LightGBM and XGBoost support GPU acceleration, which can significantly improve the performance of the algorithms for large datasets. However, LightGBM is generally considered to be more GPU-friendly than XGBoost. This is because LightGBM uses a more efficient GPU implementation and because it supports more GPUs.Here is a table summarizing the GPU compatibility of LightGBM and XGBoost:As you can see, LightGBM supports a wider range of GPUs than XGBoost. Additionally, LightGBM's GPU implementation is more efficient, which can lead to significant performance gains.If you are working with large datasets on a GPU-equipped machine, then LightGBM is the better choice. However, if you are using a machine with limited GPU resources, then XGBoost may be a better option.Here are some additional things to keep in mind when choosing between LightGBM and XGBoost for GPU acceleration: The specific GPU model you are using may affect the performance of both algorithms. The size and complexity of your dataset may also affect the performance of both algorithms. You may need to adjust the hyperparameters of both algorithms to get the best performance on your GPU.

Gradient Boosting

Mon, 18 Dec 2023 09:26:50 GMT

Gradient boosting is a supervised learning algorithm that combines multiple weak learners to form a strong learner. It is a powerful technique for both classification and regression tasks, and it is particularly well-suited for handling large, complex datasets.Overview of Gradient BoostingGradient boosting builds an ensemble of weak learners, which are simple models that are individually not very good at predicting the target variable. However, by combining these weak learners, gradient boosting can achieve much better performance than any of the individual learners could on their own.The key to gradient boosting is that each weak learner is trained to correct the errors of the previous learners. This process is repeated until the desired level of performance is achieved.Types of Weak Learners in Gradient BoostingThe most common weak learners used in gradient boosting are decision trees. Decision trees are simple tree-like structures that can be used to represent complex relationships in data. They are easy to understand and interpret, and they can be very effective at predicting the target variable.Stages of Gradient BoostingGradient boosting works in stages: Initialize the prediction: Start with an initial prediction for the target variable. This could be the mean of the target variable or some other simple estimate. Train a weak learner: Train a weak learner to predict the residual between the current prediction and the actual target values. Update the prediction: Adjust the current prediction by adding the prediction of the weak learner. Repeat: Repeat steps 2 and 3 until the desired level of performance is achieved. Advantages of Gradient BoostingGradient boosting has several advantages over other machine learning algorithms: Robustness to outliers: Gradient boosting is relatively robust to outliers, which can be a problem with other algorithms such as linear regression. Handles non-linear relationships: Gradient boosting can handle non-linear relationships between the features and the target variable, which is a strength that many other algorithms lack. Can handle large datasets: Gradient boosting is well-suited for handling large, complex datasets. Efficient implementation: Gradient boosting can be implemented efficiently, making it a good choice for large-scale applications. Applications of Gradient BoostingGradient boosting is a versatile algorithm that can be used for a wide variety of tasks, including: Predicting customer churn Optimizing website traffic Fraud detection Medical diagnosis Recommender systems If you are working with a complex dataset and you need a robust and accurate machine learning algorithm, gradient boosting is a great option to consider.

gpu_predict_batchsize

Wed, 06 Dec 2023 16:15:44 GMT

Spatial Data Science

Wed, 06 Dec 2023 08:15:47 GMT

Tool used : GeoDaSpatial correlation : measure of clustering in space using Moran's I Autocorrelation as a measure of spatial corr. Ways to get data for species distribution : voluntary observation data like Pl@ntNet from participative collaboration.Spatial Data Science : Takes into account location, distance, spatial interactions as core concepts as opposed to regular data science that doesn't take it into account as thouroughly.Data types : Points, Lines, Polygons.Visual representation on maps is better because we are very good at detecting visual patterns, whereas we suck at reading tabular data. Helps with knowledge discovery by reading graphs, maps or any data viz.Categories on a map should reflect the underlying distribution so that it makes sense.Types of maps : Percentile maps : quantile map but closer to a normal distribution, emphasis on extremes and outliers. Use a diverging colormap.
Box map : similar to box plots but for maps. The split of categories is with percentiles aswell.
We can also split regarding the standard deviation with standard devational maps. Curse of dimensionality : the larger the number of variables describing our points, the larger the space between them, leading to scarcity. This makes most brute force algorithms useless in high-dimensions.Exploratory methods do not explain, they only suggest hypotheses and interesting patterns. It is also difficult to quantify uncertainty. It does not provide causation, only correlationWe can represent more than 2 or 3 dimensions in a scatter plot using color, size and other characteristics of our points. As a result we can visualize for example 4-dimensional data in a 2D scatterplot and analyze it visually instead of analytically. These are also called Bubble plots .Parralel Coordinate Plot (PCP) : Axes as parallel lines instead of orthogonal for multi-dimensional data. Observations are shown as lines between variables instead of points like such :
Lines that are close together and parallel represent clusters in multidimensional data space as we can see in this following plot :

cluster_pcp

Tue, 05 Dec 2023 15:36:02 GMT

parralel_coord_plot

Tue, 05 Dec 2023 15:28:31 GMT

percentile_map

Tue, 05 Dec 2023 10:50:37 GMT

boxmap

Tue, 05 Dec 2023 10:49:10 GMT

Umap

Wed, 15 Nov 2023 09:01:35 GMT

Jupyter_2

Mon, 30 Oct 2023 14:37:32 GMT

Jupyter_1

Mon, 30 Oct 2023 14:32:46 GMT

Pasted image 20231019140742

Thu, 19 Oct 2023 12:07:42 GMT

Random forest

Thu, 19 Oct 2023 09:38:41 GMT

Pasted image 20231019104315

Thu, 19 Oct 2023 08:43:15 GMT

Logistic Regression

Wed, 27 Sep 2023 08:22:08 GMT

There are multiple ways to fit a line to data, the most common being Linear Regression for continuous variables. For Discrete variables such as True / False statements (ie Is this drug effective with X dosage ? ) we can use Logistic Regression which fits a logistic function similar to :

To get the 0 (False), 1 (True) values to be classified by a line (right part in the image), we project the points into a logarithmic space, called then the log(odds). These points are then projected onto the candidate line and then transformed back to probabilities to compute the likelihood of this candidate line. The Maximum Likelihood algorithm is used to determine the best line/logistic function fitting the data. Usually, to simplify the computation, we use the log(likelihood) instead of directly computing the likelihood.

Pasted image 20230927101123

Wed, 27 Sep 2023 08:11:23 GMT

Pasted image 20230927100711

Wed, 27 Sep 2023 08:07:11 GMT