cjyuResearch

A brief overview of various loss functions similar to L2 loss and their ramifications

L2 loss is a common loss or objective function used in a variety of statistical machine learning tasks. It is expressed as:

\mathcal{L}_{L2} = \sum_{i=1}^n (y_i - \hat{y}_i)^2

This loss function seems pretty innocent at first glance, however the general form of the equation pops up all over machine learning. In this article, we'll take a tourist look at the ubiquitous nature of L2 Loss and its ramification for machine learning.

These two loss functions are derivatives of L2 loss modified for special considerations. Root mean squared is simply the square root of L2 loss: $\text{RMS} = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2}$ It is mathematically identical to L2 loss for purposes with gradient descent. However it can be used to regularize the loss to be on the same scale as the original target variable, helping with model interpretability.

Huber Loss is a piecewise function that combines L1 and L2 loss. I call it L1.5 loss. It is expressed as:

L_\delta = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{for } |y - \hat{y}| \leq \delta, \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}

It is designed to exploit the corrective sharpness of L2 loss when the parameter is close to the minima while avoiding heavy penalties for outliers by using L1 loss away from the minima.

The cosh function is defined as:

\cosh(x) = \frac{e^x + e^{-x}}{2}

So the final log-cosh loss expression is similar to MSE:

L_{\text{log-cosh}} = \sum_i \log(\cosh(y_i - \hat{y}_i))

Similar to huber loss, log-cosh loss is robust against outliers but comes with the advantage of being twice differentiable everywhere. We can visualize the behaviors of these three loss functions in the simple 2D and 3D cases. The loss functions all start off with similar behavior but diverge from each other as the predicted feature distance increases.

K means clustering uses L2 loss to minimize the distance between cluster centers and nearby points, taking the loss as the summation of L2 distances between centers and data points:

J = \sum_{i=1}^k \sum_{x \in C_i} ||x - \mu_i||^2

These loss functions are all derivatives of the L2 loss. They are slightly modified to correct for issues such as sensitivity to outliers or global smoothness but all fall in line with the original technique of the L2 loss. Now let's take a look at loss functions that are not direct descendants of L2 but share similar forms.

Contrastive loss and Triplet loss are both loss functions used in unsupervised learning settings to generate embedding representations. Their forms are:

\begin{aligned} L_{\text{contrastive}} &= (1-Y)\frac{1}{2}(D_w)^2 + Y\frac{1}{2}\{\max(0, m-D_w)\}^2 \\ L_{\text{triplet}} &= \max(0, ||f(x_a^i) - f(x_p^i)||_2^2 - ||f(x_a^i) - f(x_n^i)||_2^2 + \alpha) \end{aligned}

The D in contrastive loss represents the distance between two embeddings, expressed as $\sqrt{\sum_{i=1}^n (x_{1i} - x_{2i})^2}$ .

In graph theory, Graph Matching Loss is used to measure distance between two graphs based on only their adjacency matrices. It uses a permutation matrix to mask certain edges and measure the distance between them. It is defined as:

L_{GM} = ||A - PBP^T||_F^2

Flow Matching Loss is an incredibly powerful loss function that allows us to train models that can leverage the capabilities of Flow Based Modeling. The loss function itself is defined as:

L_{FM} = \mathbb{E}_{x_0, t, \epsilon}[||v_\theta(x_t, t) - v(x_t, t)||_2^2]

Dreambooth has a general loss equation of: $\text{total loss} = \text{original loss} + \lambda \cdot \text{prior preservation loss}$

The prior preservation loss term is there specifically to preserve the generative prior after fine tuning. This preservation term is expressed as:

\mathbb{E}_{x,c,\epsilon,t}\left[ ||\epsilon - \epsilon_\theta(z_t,t,c)||_2^2 + \lambda ||\epsilon' - \epsilon_\theta (z'_{t'},t',c_{pr})||_2^2 \right]

Style transfer loss for CNNs follows a similar form where there are multiple terms inspired by L2 that contribute to the general loss function. Many loss functions have multiple terms, often using a coefficient to balance the influence of the terms on the loss landscape. They are often used in regularization or fine tuning. Here's a visualization of what combining loss terms could look like.

The broad form of the L2 loss is $f(x-\hat{x})$ where f is some magnifying function, usually a squared function, and $$x$$ is the true value while $\hat{x}$ is the empirical value.

A Tourist View of L2 Loss and Related Functions