Jamshid Sourati | Publications

CVPR

Towards robust and reproducible active learning using neural networks

Munjal, Prateek, Hayat, Nasir, Hayat, Munawar G, Sourati, Jamshid, and Khan, Shadab

In CVPR 2022

Abs

Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data and help reduce annotation cost in domains where labeling entire data can be prohibitive. Recently proposed neural network based AL methods use different heuristics to accomplish this goal. In this study, we show that recent AL methods offer a gain over random baseline under a brittle combination of experimental conditions. We demonstrate that such marginal gains vanish when experimental factors are changed, leading to reproducibility issues and suggesting that AL methods lack robustness. We also observe that with a properly tuned model, which employs recently proposed regularization techniques, the performance significantly improves for all AL methods including the random sampling baseline, and performance differences among the AL methods become negligible. Based on these observations, we suggest a set of experiments that are critical to assess the true effectiveness of an AL method. To facilitate these experiments we also present an open source toolkit. We believe our findings and recommendations will help advance reproducible research in robust AL using neural networks.

JMLR

Asymptotic analysis of objectives based on fisher information in active learning

Sourati, Jamshid, Akcakaya, Murat, Leen, Todd K, Erdogmus, Deniz, and Dy, Jennifer G

The Journal of Machine Learning Research 2017

Abs PDF

Obtaining labels can be costly and time-consuming. Active learning allows a learning algorithm to intelligently query samples to be labeled for a more efficient learning. Fisher information ratio (FIR) has been used as an objective for selecting queries. However, little is known about the theory behind the use of FIR for active learning. There is a gap between the underlying theory and the motivation of its usage in practice. In this paper, we attempt to fill this gap and provide a rigorous framework for analyzing existing FIR-based active learning methods. In particular, we show that FIR can be asymptotically viewed as an upper bound of the expected variance of the log-likelihood ratio. Additionally, our analysis suggests a unifying framework that not only enables us to make theoretical comparisons among the existing querying methods based on FIR, but also allows us to give insight into the development of new active learning approaches based on this objective.
TPAMI

A probabilistic active learning algorithm based on fisher information ratio

Sourati, Jamshid, Akcakaya, Murat, Erdogmus, Deniz, Leen, Todd K, and Dy, Jennifer G

IEEE transactions on pattern analysis and machine intelligence 2017

Abs URL

The task of labeling samples is demanding and expensive. Active learning aims to generate the smallest possible training data set that results in a classifier with high performance in the test phase. It usually consists of two steps of selecting a set of queries and requesting their labels. Among the suggested objectives to score the query sets, information theoretic measures have become very popular. Yet among them, those based on Fisher information (FI) have the advantage of considering the diversity among the queries and tractable computations. In this work, we provide a practical algorithm based on Fisher information ratio to obtain query distribution for a general framework where, in contrast to the previous FI-based querying methods, we make no assumptions over the test distribution. The empirical results on synthetic and real-world data sets indicate that this algorithm gives competitive results.

Entropy

Classification active learning based on mutual information

Sourati, Jamshid, Akcakaya, Murat, Dy, Jennifer G, Leen, Todd K, and Erdogmus, Deniz

Entropy 2016

Abs URL

Selecting a subset of samples to label from a large pool of unlabeled data points, such that a sufficiently accurate classifier is obtained using a reasonably small training set is a challenging, yet critical problem. Challenging, since solving this problem includes cumbersome combinatorial computations, and critical, due to the fact that labeling is an expensive and time-consuming task, hence we always aim to minimize the number of required labels. While information theoretical objectives, such as mutual information (MI) between the labels, have been successfully used in sequential querying, it is not straightforward to generalize these objectives to batch mode. This is because evaluation and optimization of functions which are trivial in individual querying settings become intractable for many objectives when we are to select multiple queries. In this paper, we develop a framework, where we propose efficient ways of evaluating and maximizing the MI between labels as an objective for batch mode active learning. Our proposed framework efficiently reduces the computational complexity from an order proportional to the batch size, when no approximation is applied, to the linear cost. The performance of this framework is evaluated using data sets from several fields showing that the proposed framework leads to efficient active learning for most of the data sets.

TMI

Intelligent labeling based on fisher information for medical image segmentation using deep learning

Sourati, Jamshid, Gholipour, Ali, Dy, Jennifer G, Tomas-Fernandez, Xavier, Kurugol, Sila, and Warfield, Simon K

IEEE transactions on medical imaging 2019

Abs URL

Deep convolutional neural networks (CNN) have recently achieved superior performance at the task of medical image segmentation compared to classic models. However, training a generalizable CNN requires a large amount of training data, which is difficult, expensive, and time-consuming to obtain in medical settings. Active Learning (AL) algorithms can facilitate training CNN models by proposing a small number of the most informative data samples to be annotated to achieve a rapid increase in performance. We proposed a new active learning method based on Fisher information (FI) for CNNs for the first time. Using efficient backpropagation methods for computing gradients together with a novel low-dimensional approximation of FI enabled us to compute FI for CNNs with a large number of parameters. We evaluated the proposed method for brain extraction with a patch-wise segmentation CNN model in two different learning scenarios: universal active learning and active semi-automatic segmentation. In both scenarios, an initial model was obtained using labeled training subjects of a source data set and the goal was to annotate a small subset of new samples to build a model that performs well on the target subject(s). The target data sets included images that differed from the source data by either age group (e.g. newborns with different image contrast) or underlying pathology that was not available in the source data. In comparison to several recently proposed AL methods and brain extraction baselines, the results showed that FI-based AL outperformed the competing methods in improving the performance of the model after labeling a very small portion of target data set (< 0.25%).

MICCAI-DLMIA

Active deep learning with Fisher information for patch-wise semantic segmentation

Sourati, Jamshid, Gholipour, Ali, Dy, Jennifer G, Kurugol, Sila, and Warfield, Simon K

2018

Abs URL

Deep learning with convolutional neural networks (CNN) has achieved unprecedented success in segmentation, however it requires large training data, which is expensive to obtain. Active Learning (AL) frameworks can facilitate major improvements in CNN performance with intelligent selection of minimal data to be labeled. This paper proposes a novel diversified AL based on Fisher information (FI) for the first time for CNNs, where gradient computations from backpropagation are used for efficient computation of FI on the large CNN parameter space. We evaluated the proposed method in the context of newborn and adolescent brain extraction problem under two scenarios: (1) semi-automatic segmentation of a particular subject from a different age group or with a pathology not available in the original training data, where starting from an inaccurate pre-trained model, we iteratively label small number of voxels queried by AL until the model generates accurate segmentation for that subject, and (2) using AL to build a universal model generalizable to all images in a given data set. In both scenarios, FI-based AL improved performance after labeling a small percentage (less than 0.05%) of voxels. The results showed that FI-based AL significantly outperformed random sampling, and achieved accuracy higher than entropy-based querying in transfer learning, where the model learns to extract brains of newborn subjects given an initial model trained on adolescents.

TIP

Accelerated learning-based interactive image segmentation using pairwise constraints

Sourati, Jamshid, Erdogmus, Deniz, Dy, Jennifer G, and Brooks, Dana H

IEEE transactions on image processing 2014

Abs URL

Algorithms for fully automatic segmentation of images are often not sufficiently generic with suitable accuracy, and fully manual segmentation is not practical in many settings. There is a need for semiautomatic algorithms, which are capable of interacting with the user and taking into account the collected feedback. Typically, such methods have simply incorporated user feedback directly. Here, we employ active learning of optimal queries to guide user interaction. Our work in this paper is based on constrained spectral clustering that iteratively incorporates user feedback by propagating it through the calculated affinities. The original framework does not scale well to large data sets, and hence is not straightforward to apply to interactive image segmentation. In order to address this issue, we adopt advanced numerical methods for eigen-decomposition implemented over a subsampling scheme. Our key innovation, however, is an active learning strategy that chooses pairwise queries to present to the user in order to increase the rate of learning from the feedback. Performance evaluation is carried out on the Berkeley segmentation and Graz-02 image data sets, confirming that convergence to high accuracy levels is realizable in relatively few iterations.

HDSR

Data on how science is made can make science better

Sourati, Jamshid, Belikov, Alexander, and Evans, James

Harvard Data Science Review 2022

Abs URL

Science is an engine of innovation and economic growth and a pathway to prosperity for countries around the world. The increasing availability of scientific publications today poses a data-driven opportunity to better understand and improve science. Scientific publications contain data on the content of published research and metadata on the context that gave rise to that research. Here we discuss and demonstrate the power of constructing, archiving, and analyzing links between scientific data and metadata to construct massive computational observatories of and for modern science. We show how these can be constructed using modern graph databases, and suggest some methods of analysis with potential to unleash sustained value for science and society. These scientific observatories would allow us to diagnose the health of the scientific workforce and institutions, and track the rate of scientific advance. They could enable us to better guide science policy and build portfolios of supported research that balance our societal commitments to diverse participation and prosperity. Moreover, they could enable scientists to surf the deluge of published research to open the scientific frontier in directions that do not follow the current, but open up new views and opportunities for others to follow. Linked scientific data can also enable the construction of artificial intelligence agents designed to complement the disciplinary focus of human scientific attention by proposing possibilities overlooked or underfunded by contemporary scientific institutions. Finally, we argue for the importance of ongoing political and legal support for the promotion of open, linked data to facilitate widespread benefit.
Patent

Systems and methods for high-order modeling of predictive hypotheses

Evans, James, Shi, Feng, and Sourati, Jamshid

2022

Abs URL

Embodiments disclosed herein receive a corpus of documents associated with a predictive hypothesis. The embodiments may generate a hypergraph comprising a plurality of nodes, the plurality of nodes including content nodes representing content elements from the documents and context nodes representing context elements of the documents, and hyperedges representing each document spanning two or more of the plurality of nodes. This hypergraph may be used to store a predictive hypothesis including a subset of the content elements, each content element of the subset of content elements having a vector representation meeting a predictive hypothesis threshold.

arXiv

Accelerating science with human versus alien artificial intelligences

Sourati, Jamshid, and Evans, James

arXiv e-prints 2021

Abs PDF

Data-driven artificial intelligence models fed with published scientific findings have been used to create powerful prediction engines for scientific and technological advance, such as the discovery of novel materials with desired properties and the targeted invention of new therapies and vaccines. These AI approaches typically ignore the distribution of human prediction engines – scientists and inventor – who continuously alter the landscape of discovery and invention. As a result, AI hypotheses are designed to substitute for human experts, failing to complement them for punctuated collective advance. Here we show that incorporating the distribution of human expertise into self-supervised models by training on inferences cognitively available to experts dramatically improves AI prediction of future human discoveries and inventions. Including expert-awareness into models that propose (a) valuable energy-relevant materials increases the precision of materials predictions by 100%, (b) repurposing thousands of drugs to treat new diseases increases precision by 43%, and (c) COVID-19 vaccine candidates examined in clinical trials by 260%. These models succeed by predicting human predictions and the scientists who will make them. By tuning AI to avoid the crowd, however, it generates scientifically promising "alien" hypotheses unlikely to be imagined or pursued without intervention, not only accelerating but punctuating scientific advance. By identifying and correcting for collective human bias, these models also suggest opportunities to improve human prediction by reformulating science education for discovery.

Publications

Generic Active Learning

Applied Active Learning

Unsuperised Learning

AI-assisted Knowledge Discovery