Node classification is an important problem in graph data management. It is commonly solved by various label propagation methods (including Graph Neural Networks) that work iteratively starting from a few labeled seed nodes. For graphs with arbitrary compatibilities between classes, these methods crucially depend on knowing the compatibility matrix that must be provided by either domain experts or learned from large enough data. Can we instead directly estimate the correct compatibilities from a sparsely labeled graph in a principled and scalable way? We answer this question affirmatively and suggest a method called distant compatibility estimation that works even on extremely sparsely labeled graphs (e.g., 1 in 10,000 nodes is labeled) in a fraction of the time it later takes to label the remaining nodes. Our approach first creates multiple factorized graph representations (with size independent of the graph) and then performs estimation on these smaller graph sketches. We refer to algebraic amplification as the more general idea of leveraging algebraic properties of an algorithm’s update equations (in particular, the distributivity law) to amplify sparse signals. We show that our estimator is by orders of magnitude faster than a textbook approach based on hold-outs and that the end-to-end classification accuracy is comparable to using gold standard compatibilities. 

SIGMOD 2020: Factorized Graph Representations for Semi-Supervised Learning from Sparse Data 


VLDB 2015: Linearized and single-pass belief propagation 



Video Recording