BIGDATA: F: New Algorithms of Online Machine Learning for Big Data
University Of Iowa, Iowa City IA
Investigators
Abstract
This project is developing innovative, theoretically rigorous algorithms to learn from continuously arriving (streaming) data. Specific challenges addressed are class imbalance (one of the concepts to be learned is very rare, as in disease detection), cost constraints on both obtaining features (e.g., computationally expensive image processing), and cost constraints on obtaining class labels (e.g., human annotation.) The algorithms developed in this project make it possible to effectively address big data challenges in streaming data due to increased complexities in various aspects such as heavily imbalanced data distributions, ultrahigh dimensional features, a large number of labels, highly complex constraints, etc. The project will also contribute to training future professionals in big data analytics, including participation in the University of Iowa's undergraduate summer research program and high school student training program. Most work devoted to online learning algorithms and their analysis were developed with the goal of minimizing a symmetric measure (e.g., the classification error) and without considering practical constraints arising in big data. This project addresses imbalanced data by developing online learning algorithms for minimizing asymmetric measures including F-score, area under the ROC curve, and area under precision and recall curve. Convex or non-convex surrogate loss functions that well-approximate these asymmetric measures are constructed and minimized in an online fashion. The project also develops online algorithms under three types of constraints arising in big data context namely constraints on computing costs, on query costs, and complex inequality constraints, by exploring techniques in randomized algorithms, active learning and convex optimization. The developed algorithms are being evaluated in real applications including biomedical semantic indexing, social media mining, and image annotation.
View original record on NSF Award Search →