Information-Computation Tradeoffs for Noiseless Linear Regression with Oblivious Contamination
By: Ilias Diakonikolas , Chao Gao , Daniel M. Kane and more
Potential Business Impact:
Find hidden patterns even with bad data.
We study the task of noiseless linear regression under Gaussian covariates in the presence of additive oblivious contamination. Specifically, we are given i.i.d.\ samples from a distribution $(x, y)$ on $\mathbb{R}^d \times \mathbb{R}$ with $x \sim \mathcal{N}(0,\mathbf{I}_d)$ and $y = x^\top \beta + z$, where $z$ is drawn independently of $x$ from an unknown distribution $E$. Moreover, $z$ satisfies $\mathbb{P}_E[z = 0] = \alpha>0$. The goal is to accurately recover the regressor $\beta$ to small $\ell_2$-error. Ignoring computational considerations, this problem is known to be solvable using $O(d/\alpha)$ samples. On the other hand, the best known polynomial-time algorithms require $\Omega(d/\alpha^2)$ samples. Here we provide formal evidence that the quadratic dependence in $1/\alpha$ is inherent for efficient algorithms. Specifically, we show that any efficient Statistical Query algorithm for this task requires VSTAT complexity at least $\tilde{\Omega}(d^{1/2}/\alpha^2)$.
Similar Papers
Confidence Intervals for Linear Models with Arbitrary Noise Contamination
Statistics Theory
Finds reliable answers even with bad data.
On robust recovery of signals from indirect observations
Statistics Theory
Fixes messy data to find hidden information.
Online Linear Regression with Paid Stochastic Features
Machine Learning (CS)
Learns better by choosing how much to pay for cleaner data.