Abstract
We study the problem of lossless universal source coding for stationary memoryless sources on countably infinite alphabets. This task is generally not achievable without restricting the class of sources over which universality is desired. We propose natural families of sources characterized by a common dominating envelope.
We particularly emphasize the notion of adaptivity, which is the ability to perform as well as an oracle knowing the envelope, without actually knowing it. This is closely related to the notion of hierarchical universal source coding, but with the important difference that families of envelope classes are not discretely indexed and there is no well-ordered nesting.
We extend the classes of envelopes over which adaptive universal source coding is possible, namely by including max-stable (heavy-tailed) and sub-exponential (light-tailed) envelopes which are relevant models in many applications, such as natural language modeling. Minimax lower bounds on the redundancy of any code on such envelope classes have been recently clarified. We propose constructive codes that do not use knowledge of the envelope. These codes are computationally efficient and structured to use an expanding threshold for auto-censoring (ETAC). We prove that the ETAC-code achieves the lower bound on the minimax redundancy within a factor logarithmic in the sequence length, and can be therefore qualified as a near-adaptive code over families of heavy-tailed envelopes. For finite and light-tailed envelopes the penalty is even less, and the same code follows closely related results that explicitly make the light-tailed assumption. Our technical results are founded on methods from regular variation theory and concentration of measure.
Joint work with D. Bontemps, A. Garivier, E. Gassiat, M Ohannessian.