Results 1841 - 1850 of 23856
Single-linkage clustering is a fundamental method for data analysis. A single-linkage clustering is obtained by iteratively merging the closest pair of clusters, where initially every data point forms its own cluster and the distance between clusters is defined by their closest pair of points. Algorithmically, one can compute a single linkage $k$-clustering (a partition into $k$ clusters) by computing a minimum spanning tree and dropping the $k-1$ most costly edges. The resulting forest spans the clusters. This motivates to define the cost $\mathrm{cost}_k$ of a single linkage $k$-clustering by the weight of the corresponding spanning forest. If we consider single linkage clustering as computing a hierarchy of clusterings, the cost of the hierarchy is defined as the sum of the individual clusterings. We assume that the distances between data points are given as a graph $G$ with average degree $d$ and edge weights from $\{1,\dots, W\}$. If there is no edge, we assume the distance to be infinite. We give a sampling based algorithm that is given access to the adjacency list representation of $G$ and that computes in $\tilde O(d\sqrt{W}/\varepsilon^3)$ time a succinct representation of estimates $\widehat{\mathrm{cost}}_k$ of the cost of every $k$-clustering such that $\sum_{i=1}^n |\widehat{\mathrm{cost}} - \mathrm{cost}_k| \le \varepsilon \mathrm{cost}(G)$, where $\mathrm{cost}(G) = \sum_{i=1}^n \mathrm{cost}_k$ is the cost of the hierarchical clustering and $1>\varepsilon >0$. Thus we can approximate the cost of every $k$-clustering upto an absolute error that \emph{on average} is a $(1+\varepsilon)$-approximation of the true cost. In particular, our result also implies that we can estimate $\mathrm{cost}(G)$ upto a factor of $1\pm \varepsilon$ in the same running time. We extend our results to the setting when edges represent similarities and the clusters are defined by a maximum spanning tree. In this setting, the running time of our algorithm is $\tilde O(dW/\varepsilon^3)$. We show that our algorithm are almost optimal by giving near matching lower bounds for estimating $\cost(G)$. We conduct extensive experiments that validate our theoretical findings. Joint work with Christian Sohler and Yi Xu.
Given i.i.d. samples from an unknown distribution $P$, the goal of distribution learning is to recover the parameters of a distribution that is close to $P$. We revisit this problem when the learner is given as advice the parameters of a distribution $Q$. Suppose $P$ is from the class of balanced product distributions over the n-dimensional Boolean hypercube. We show that there is an efficient algorithm to learn $P$ within TV distance $\epsilon$ with the sample complexity being $\tilde{O}(n^{1-c\eta}/\eps^2)$ for a constant $c>0$, if $\|\bm{p} - \bm{q}\|_1<\eps d^{(1-\eta)/2}$. Here, $\bm{p}$ and $\bm{q}$ are the mean vectors of $P$ and $Q$ respectively, and $\eta$ is unknown a priori. Without advice, it is known that the sample complexity must be linear in n. We show similar results for learning high-dimensional gaussians also.
Yiqiao Bao
Title: Nearly Tight Bounds on Testing of Metric Properties
Abstract: Given a non-negative $n \times n$ matrix viewed as a set of distances between $n$ points, we consider the property testing problem of deciding if it is a metric. For general metrics, our paper is the first to consider these questions. We prove an upper bound of $O(n^{2/3}/\epsilon^{4/3})$ on the query complexity for this problem. Our algorithm is simple, but the analysis requires great care in bounding the variance on the number of violating triangles in a sample. When $\epsilon$ is a slowly decreasing function of $n$ (rather than a constant, as is standard), we prove a lower bound of matching dependence on $n$ of $\Omega (n^{2/3})$, ruling out any property testers with $o(n^{2/3})$ query complexity unless their dependence on $1/\epsilon$ is super-polynomial.
Based on joint work with Sampath Kannan and Erik Waingarten.
Amir Azarmehr
Title: Stochastic Matching via In-n-Out Local Computation Algorithms
Abstract: This talk is based on joint work with Soheil Behnezhad, Alma Ghafari, and Ronitt Rubinfeld.
Consider the following stochastic matching problem. We are given a known graph G = (V, E). An unknown subgraph G_p = (V, E_p) is realized where E_p includes every edge of E independently with some probability p ∈ (0, 1]. The goal is to query a sparse subgraph H of G, such that the realized edges in H include an approximate maximum matching of G_p.
We show that a poly(1/p)-degree subgraph can obtain a (1 − ε)-approximation for any desirably small fixed ε > 0, unifying two long lines of work that achieve O(1)-approximations using poly(1/p) degree, and (1 − ε)-approximation using quasipoly(1/p) degree. Beyond its quantitative improvement, a key conceptual contribution of our work is to connect local computation algorithms (LCAs) to the stochastic matching problem for the first time. While prior work on LCAs mainly focuses on their out-queries (the number of vertices probed to produce the output of a given vertex), our analysis also bounds the in-queries (the number of vertices that probe a given vertex). We prove that the outputs of LCAs with bounded in- and out-queries (in-n-out LCAs for short) have limited correlation, a property that our analysis crucially relies on and might find applications beyond stochastic matchings.
Janani Sundaresan
Title: Distributed Triangle Detection is Hard in Few Rounds
Abstract: In the CONGEST model, $n$ vertices with information only about their neighborhoods are allowed to communicate to each other over the edges of the input graph. Communication happens in synchronous rounds with a bandwidth of $O(\log n)$. We prove that detecting a triangle in this model requires $\Omega(\log \log n)$ rounds.
Prior to our work, the only lower bound was that at least two rounds are necessary.
It is known that standard communication complexity arguments that have been used to get lower bounds in the CONGEST model in the past are incapable of giving any meaningful multi-round lower bounds for this problem. Our main contribution is a new information-theoretic technique that combines classical round elimination arguments of communication complexity with the point-to-point communication aspects of distributed networks and can be of independent interest.
Joint work with Sepehr Assadi (https://arxiv.org/abs/2504.01802).
We introduce a novel technique for "lifting" dimension lower bounds for linear sketches in the continuous setting to dimension lower bounds for linear sketches with polynomially-bounded integer entries when the input is a polynomially-bounded integer vector. Using this technique, we obtain the first optimal sketching lower bounds for discrete inputs in a data stream, for classical problems such as approximating the frequency moments, estimating the operator norm, and compressed sensing. Additionally, we lift the adaptive attack of Hardt and Woodruff (STOC, 2013) for breaking any real-valued linear sketch via a sequence of real-valued queries, and show how to obtain an attack on any integer-valued linear sketch using integer-valued queries. This shows that there is no linear sketch in a data stream with insertions and deletions that is adversarially robust for approximating any L_p norm of the input. This resolves a central open question for adversarially robust streaming algorithms. To do so, we introduce a new pre-processing technique of independent interest which, given an integer-valued linear sketch, increases the dimension of the sketch by only a constant factor in order to make the orthogonal lattice to its row span smooth, enabling us to leverage results in lattice theory on discrete Gaussian distributions and reason that efficient discrete sketches imply efficient continuous sketches. Our work resolves open questions from the Banff '14 and '17 workshops on Communication Complexity and Applications, as well as the STOC '21 and FOCS '23 workshops on adaptivity and robustness.
We introduce a novel technique for "lifting" dimension lower bounds for linear sketches in the continuous setting to dimension lower bounds for linear sketches with polynomially-bounded integer entries when the input is a polynomially-bounded integer vector. Using this technique, we obtain the first optimal sketching lower bounds for discrete inputs in a data stream, for classical problems such as approximating the frequency moments, estimating the operator norm, and compressed sensing. Additionally, we lift the adaptive attack of Hardt and Woodruff (STOC, 2013) for breaking any real-valued linear sketch via a sequence of real-valued queries, and show how to obtain an attack on any integer-valued linear sketch using integer-valued queries. This shows that there is no linear sketch in a data stream with insertions and deletions that is adversarially robust for approximating any L_p norm of the input. This resolves a central open question for adversarially robust streaming algorithms. To do so, we introduce a new pre-processing technique of independent interest which, given an integer-valued linear sketch, increases the dimension of the sketch by only a constant factor in order to make the orthogonal lattice to its row span smooth, enabling us to leverage results in lattice theory on discrete Gaussian distributions and reason that efficient discrete sketches imply efficient continuous sketches. Our work resolves open questions from the Banff '14 and '17 workshops on Communication Complexity and Applications, as well as the STOC '21 and FOCS '23 workshops on adaptivity and robustness.
Joint work with Elena Gribelyuk, Honghao Lin, Huacheng Yu, and Samson Zhou
This reunion workshop is for long-term participants in the program " Sublinear Algorithms" held in the summer 2024 semester. It will provide an opportunity to meet old and new friends. Moreover, we hope that it will give everyone a chance to reflect on the...
Going beyond worst-case complexity, there have been exciting developments in the understanding of complexity of random instances of CSPs such as: (1) Refutation of random CSPs: The method of Kikuchi matrices has led to new algorithms for refutation, and...
We consider indistinguishability obfuscation (iO) for multi-output circuits of size s, where s is the number of AND/OR/NOT gates in C. Under the worst-case assumption that NP not in BPP, we establish that there is no efficient indistinguishability obfuscation scheme that outputs circuits of size s+ o(s/log s). In other words, to be secure, an efficient iO scheme must incur an s/log s additive overhead in the size of the obfuscated circuit.
The proof of our main result builds on a connection between obfuscation and meta-complexity, and on the NP-hardness of circuit minimization for multi-output circuits established by Loff, Ilango, and Oliveira [ILO20], together with other techniques from cryptography and complexity theory.
Based on a joint work with Zhenjian Lu, Igor C. Oliveira and Rafael Pass.
In this talk, we will introduce a new public key encryption assuming hardness of the standard planted clique conjecture, and a relatively mild hardness conjecture about noisy k-LIN over expanders that is not known to imply public-key encryption on its own. Both of our conjectures correspond to natural average-case variants of NP-complete problems and have been studied for multiple decades, with unconditional lower bounds supporting them in a variety of restricted models of computation. Our work extends the seminal work by Applebaum, Barak, and Wigderson [STOC'10] and gives the first such construction using the planted clique conjecture in the standard form.