## Analyzing the effect of side information: a perturbative approach

Quite often we need to make decisions. These decisions will typically depend on some information that we have obtained, either through direct observation or because it has been communicated to us by someone (or something) else. This side information is not usually perfect. For example, our measurements will not be without error and information communicated to us will be subject to imperfections in the communication medium.

How can we understand the effect of imperfect side information on our decisions?

More precisely, suppose that we seek to solve an optimization problem

${\min_{\mathbf{x} \in \mathcal{X}} f(\mathbf{x};\mathbf{y})}$,

where ${\mathbf{x}}$ is what we want to optimize, ${\mathbf{y}}$ is the side information we have, ${\mathcal{X}}$ are some constraints on ${\mathbf{x}}$, and ${f}$ is some function that describes how good our decision is. We use the notation ${f(\mathbf{x};\mathbf{y})}$ to distinguish between the variables that we are optimizing over and the side information that we have.

Now, what if our side information is not perfect. This means that our observations ${\hat{\mathbf{y}}}$ differ from the reality ${\mathbf{y}}$. The cost of this imperfect side information is then given by

${L = |f^*(\hat{\mathbf{y}}) - f^*(\mathbf{y})|}$,

where ${f^*(\mathbf{y}) = \max_{\mathbf{x} \in \mathcal{X}} f(\mathbf{x};\mathbf{y})}$.

In general, it is very difficult to compute ${L}$ since we typically cannot write the solution in closed-form. So, what can we do?

Recently with Slava Kungurtsev, I have been looking into this problem. Our approach is based on a perturbative framework. That is, we evaluate the cost ${L}$ when ${\hat{\mathbf{y}}}$ differs by only a small amount from ${\mathbf{y}}$. In particular, we start with a Taylor series of the optimal value of ${f}$

${f^*(\hat{\mathbf{y}}) = f^*(\mathbf{y}) + D_{\tilde{\mathbf{e}}}f^*(\mathbf{y})\|\mathbf{e}\| + o(\|\mathbf{e}\|)}$,

where ${\mathbf{e} = \hat{\mathbf{y}} - \mathbf{y}}$ is the error in the side information and ${\tilde{\mathbf{e}} = \mathbf{e}/\|\mathbf{e}\|}$ is the normalized direction.

It is important to note that we are not taking a Taylor series of the function ${f}$, which we typically can obtain in closed-form. Instead, we are expanding the optimal value of ${f}$. The key consequence of this is that it is not immediately obvious how to evaluate the directional derivative ${D_{\tilde{\mathbf{e}}}f^*(\mathbf{y})}$.

It appears to be John Danskin in the 60s who first studied how to evaluate directional derivatives of optimal value functions. Let’s start with the simplest form of his result.

Proposition: Let the real valued function ${f(\mathbf{x};y):\mathbb{R}^n \times \mathbb{R} \rightarrow \mathbb{R}}$ be twice differentiable on a compact convex subset of ${\mathbb{R}^{n+1}}$, strictly concave in ${\mathbf{x}}$ for all ${y \in \mathbb{R}}$. Let ${\mathbf{x}^*}$ be the optimal value of ${\mathbf{x}}$ on ${\mathcal{X}}$ for fixed ${y \in \mathbb{R}}$. Denote ${\psi(y) = f(\mathbf{x}^*;y)}$. Then, the directional derivative of ${\psi(y)}$ is given by ${\psi'(y) = f_y(\mathbf{x}^*(y);y)}$.

The theorem can be proved as follows. Under the hypotheses of the theorem, we can apply the implicit function theorem to write the (unique) optimal value of the variable ${\mathbf{x}}$ in terms of the side information ${y}$. In particular, ${\mathbf{x}^* = \mathbf{x}^*(y)}$ for side information ${y}$. Our optimal value function is then given by ${\psi(y) = f^*(y) = f(\mathbf{x}^*(y);y)}$.

To obtain the theorem, we now just need to differentiate ${\psi(y)}$ with respect to ${y}$. By applying the chain rule for multivariable functions, we have

${\psi'(y) = f_y(\mathbf{x}^*(y);y) + \nabla_{\mathbf{x}} f(\mathbf{x};y)\frac{d\mathbf{x}^*(y)}{dy}}$.

Since ${\mathbf{x}^*}$ is the optimal solution, it follows that ${\nabla_{\mathbf{x}} f(\mathbf{x}^*;y) = 0}$ and

${\psi'(y) = f_y(\mathbf{x}^*(y);y)}$

as required.

To illustrate how to actually apply this approach, let’s start with a simple application of Theorem 1. Recently, I have been studying ${\alpha}$-stable noise channels (see our ISIT2016 paper). A tractable lower bound of the capacity is given by

${R = \frac{1}{\alpha}\log \left(1 + \frac{\sigma_{\mathbf{X}}^{\alpha}}{\sigma_{\mathbf{N}}^{\alpha}}\right)}$,

where ${\sigma_{\mathbf{X}}}$ and ${\sigma_{\mathbf{N}}}$ are parameters analogous to the variance for the power constrained Gaussian noise channel (for more details see our paper). As such, we can consider the achievable rate of parallel ${\alpha}$-stable noise channels and find the optimal choice of input signal parameters (closely related to the power). In particular, we have the following optimization problem

${\max_{\sigma_{\mathbf{X},k},~k = 1,2,\ldots,n} \sum_{k=1}^n \frac{1}{\alpha} \log\left(1 + \frac{\rho_{k}^{\alpha/2}}{\sigma_{\mathbf{N},k}^{\alpha}}\right)}$

subject to

${\sum_{k=1}^n \rho_k \leq \sigma_{\mathbf{X},\max}^2}$

and

${\rho_k \geq 0,~k = 1,2,\ldots,n}$.

Observe that in the case ${\alpha = 2}$ (when the noise is Gaussian), we recover the standard waterfilling problem. Moreover, the optimization problem above is strictly convex but does not have a closed-form solution.

Suppose we want to know how ${\alpha}$ affects the optimal rate (obtained by solving the optimization problem). As we observed just before, the Gaussian case ${\alpha = 2}$ corresponds to the waterfilling problem and forms a natural thing to compare the rate for other values of ${\alpha}$. Taking the Taylor series expansion and applying Theorem 1, we have

${|R^*(\alpha) - R^*(2)| \leq |D_{\alpha}R^*(2)||2 - \alpha| + |o(|2-\alpha|)|}$,

with the full equation given in our ISIT2016 paper.

The result is illustrated in the following figure (where we use the approximation obtained by ignoring the little-o term). Observe that the approximation from our perturbative approach is in good agreement with a numerical estimate for a range of ${\alpha}$. This suggests that a perturbative approach is a promising way of studying otherwise intractable value functions.

We can still go further. In the example above, we only allowed perturbations in ${\mathbb{R}}$. It is also possible to consider perturbations in ${\mathbb{R}^n}$ or ${\mathbb{C}^n}$. The complex case where perturbations are in ${\mathbb{C}^n}$ is especially important in wireless communications as the side information can be interpreted in terms of channel state information.

In fact, we have worked through the details in the case of multi-user MIMO with limited feedback and optimal power control. In a future post I will describe how we can deal with perturbations in ${\mathbb{C}^n}$ and what we learn from the multi-user MIMO case study.