Collaborative Research: CIF: Small: Theory for Learning Lossless and Lossy Coding

$399,998FY2023CSENSF

University Of Hawaii, Honolulu

Investigators

Anders Host-Madsencontact Narayana Santhanam

Abstract

An estimated 330,000 billion bytes of data is generated daily in various forms: video, images, and music, but also scientific, economic, and industrial content. This enormous amount of data has already transformed modern life in ways that are transparent (social media) and in ways that are not immediately visible (furthering scientific, business, and economic goals through better modeling, forecast and use of data). Data is communicated, often wirelessly, on massive scales in many formats: videos, images, and music, and in real time applications such as gaming, streaming content, video calls and telemedicine. In order to handle this amount of data, it needs to be compressed by algorithms that examine the data to understand the underlying structure and remove redundant descriptions, seeking thus to use fewer bits to represent the same. Traditional compression method includes the well-known JPEG (joint photographic experts group) compression for images from smartphones, for example. This is a lossy compression method, as some image quality is lost. Lossless compression, with no quality loss, is typically used for compressing computer files (e.g., with Zip) and for lossless music streaming. In recent years, machine learning has become very powerful and used to solve many problems like autonomous driving, speech recognition, and implementing chatbots. A recent focus is to use machine learning for data compression. The aim of this project is to understand the fundamental theory of machine learning for data compression, for example what type of machine learning algorithms can compress data well and how many samples are needed to learn compression well. Through this fundamental understanding of data compression using machine learning, the aim is to develop more powerful compression methods, leading to more efficient use of wireless spectrum and less energy consumption by mobile devices. Recently, there has been much effort in developing machine learning methods for source coding by both researchers and high tech companies. These methods have had some success in beating traditional source coding methods. The project aims to develop fundamental bounds for performance of learning for both lossless and lossy source coding. The problem is framed in a probably approximately correct (PAC) learning framework, both uniform and non-uniform. The first part of the research considers lossless source coding, both of interest in itself and as a basis of lossy source coding, and aims to develop bounds for learning. The project investigates what factors influence the convergence of learning. This is extended with an active learning framework, where the algorithms can adapt how much data they need to examine, using more data for more subtle models and less data for simpler models, and figuring out when the underlying model may be simple with what is known as a "stopping rule." The second part of the research considers lossy source coding, in particular almost lossless source coding and lossless coding of real-valued sources. The aim is to understand in what sense source coding can be learned (e.g., uniform vs non-uniform PAC), and based on this to develop performance bounds. Estimation, compression, and learning have always been known to be subtly different, and these nuances translate into quantifiably large implications for problems harnessing them; this research will resolve some of these tangles, particularly for sources with memory. The fundamental understanding of learning for coding developed through this project will in turn result in the development of better coding methods. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →