Below is a list of my selected publications by categories in reversed chronological order. See my CV for a more comprehensive list, and my Google Scholar for the complete records.
Generic Active Learning
Towards robust and reproducible active learning using neural networks
Munjal, Prateek,
Hayat, Nasir,
Hayat, Munawar G,
Sourati, Jamshid,
and Khan, Shadab
In CVPR
2022
Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data and help reduce annotation cost in domains where labeling entire data can be prohibitive. Recently proposed neural network based AL methods use different heuristics to accomplish this goal. In this study, we show that recent AL methods offer a gain over random baseline under a brittle combination of experimental conditions. We demonstrate that such marginal gains vanish when experimental factors are changed, leading to reproducibility issues and suggesting that AL methods lack robustness. We also observe that with a properly tuned model, which employs recently proposed regularization techniques, the performance significantly improves for all AL methods including the random sampling baseline, and performance differences among the AL methods become negligible. Based on these observations, we suggest a set of experiments that are critical to assess the true effectiveness of an AL method. To facilitate these experiments we also present an open source toolkit. We believe our findings and recommendations will help advance reproducible research in robust AL using neural networks.
Asymptotic analysis of objectives based on fisher information in active learning
Sourati, Jamshid,
Akcakaya, Murat,
Leen, Todd K,
Erdogmus, Deniz,
and Dy, Jennifer G
The Journal of Machine Learning Research
2017
Obtaining labels can be costly and time-consuming. Active learning allows a learning algorithm to intelligently query samples to be labeled for a more efficient learning. Fisher information ratio (FIR) has been used as an objective for selecting queries. However, little is known about the theory behind the use of FIR for active learning. There is a gap between the underlying theory and the motivation of its usage in practice. In this paper, we attempt to fill this gap and provide a rigorous framework for analyzing existing FIR-based active learning methods. In particular, we show that FIR can be asymptotically viewed as an upper bound of the expected variance of the log-likelihood ratio. Additionally, our analysis suggests a unifying framework that not only enables us to make theoretical comparisons among the existing querying methods based on FIR, but also allows us to give insight into the development of new active learning approaches based on this objective.
A probabilistic active learning algorithm based on fisher information ratio
Sourati, Jamshid,
Akcakaya, Murat,
Erdogmus, Deniz,
Leen, Todd K,
and Dy, Jennifer G
IEEE transactions on pattern analysis and machine intelligence
2017
The task of labeling samples is demanding and expensive. Active learning aims to generate the smallest possible training data set that results in a classifier with high performance in the test phase. It usually consists of two steps of selecting a set of queries and requesting their labels. Among the suggested objectives to score the query sets, information theoretic measures have become very popular. Yet among them, those based on Fisher information (FI) have the advantage of considering the diversity among the queries and tractable computations. In this work, we provide a practical algorithm based on Fisher information ratio to obtain query distribution for a general framework where, in contrast to the previous FI-based querying methods, we make no assumptions over the test distribution. The empirical results on synthetic and real-world data sets indicate that this algorithm gives competitive results.
Classification active learning based on mutual information
Sourati, Jamshid,
Akcakaya, Murat,
Dy, Jennifer G,
Leen, Todd K,
and Erdogmus, Deniz
Entropy
2016
Selecting a subset of samples to label from a large pool of unlabeled data points, such that a sufficiently accurate classifier is obtained using a reasonably small training set is a challenging, yet critical problem. Challenging, since solving this problem includes cumbersome combinatorial computations, and critical, due to the fact that labeling is an expensive and time-consuming task, hence we always aim to minimize the number of required labels. While information theoretical objectives, such as mutual information (MI) between the labels, have been successfully used in sequential querying, it is not straightforward to generalize these objectives to batch mode. This is because evaluation and optimization of functions which are trivial in individual querying settings become intractable for many objectives when we are to select multiple queries. In this paper, we develop a framework, where we propose efficient ways of evaluating and maximizing the MI between labels as an objective for batch mode active learning. Our proposed framework efficiently reduces the computational complexity from an order proportional to the batch size, when no approximation is applied, to the linear cost. The performance of this framework is evaluated using data sets from several fields showing that the proposed framework leads to efficient active learning for most of the data sets.
Applied Active Learning
Intelligent labeling based on fisher information for medical image segmentation using deep learning
Sourati, Jamshid,
Gholipour, Ali,
Dy, Jennifer G,
Tomas-Fernandez, Xavier,
Kurugol, Sila,
and Warfield, Simon K
IEEE transactions on medical imaging
2019
Deep convolutional neural networks (CNN) have recently achieved superior performance at the task of medical image segmentation compared to classic models. However, training a generalizable CNN requires a large amount of training data, which is difficult, expensive, and time-consuming to obtain in medical settings. Active Learning (AL) algorithms can facilitate training CNN models by proposing a small number of the most informative data samples to be annotated to achieve a rapid increase in performance. We proposed a new active learning method based on Fisher information (FI) for CNNs for the first time. Using efficient backpropagation methods for computing gradients together with a novel low-dimensional approximation of FI enabled us to compute FI for CNNs with a large number of parameters. We evaluated the proposed method for brain extraction with a patch-wise segmentation CNN model in two different learning scenarios: universal active learning and active semi-automatic segmentation. In both scenarios, an initial model was obtained using labeled training subjects of a source data set and the goal was to annotate a small subset of new samples to build a model that performs well on the target subject(s). The target data sets included images that differed from the source data by either age group (e.g. newborns with different image contrast) or underlying pathology that was not available in the source data. In comparison to several recently proposed AL methods and brain extraction baselines, the results showed that FI-based AL outperformed the competing methods in improving the performance of the model after labeling a very small portion of target data set (< 0.25%).
Active deep learning with Fisher information for patch-wise semantic segmentation
Sourati, Jamshid,
Gholipour, Ali,
Dy, Jennifer G,
Kurugol, Sila,
and Warfield, Simon K
2018
Deep learning with convolutional neural networks (CNN) has achieved unprecedented success in segmentation, however it requires large training data, which is expensive to obtain. Active Learning (AL) frameworks can facilitate major improvements in CNN performance with intelligent selection of minimal data to be labeled. This paper proposes a novel diversified AL based on Fisher information (FI) for the first time for CNNs, where gradient computations from backpropagation are used for efficient computation of FI on the large CNN parameter space. We evaluated the proposed method in the context of newborn and adolescent brain extraction problem under two scenarios: (1) semi-automatic segmentation of a particular subject from a different age group or with a pathology not available in the original training data, where starting from an inaccurate pre-trained model, we iteratively label small number of voxels queried by AL until the model generates accurate segmentation for that subject, and (2) using AL to build a universal model generalizable to all images in a given data set. In both scenarios, FI-based AL improved performance after labeling a small percentage (less than 0.05%) of voxels. The results showed that FI-based AL significantly outperformed random sampling, and achieved accuracy higher than entropy-based querying in transfer learning, where the model learns to extract brains of newborn subjects given an initial model trained on adolescents.
Unsuperised Learning
Accelerated learning-based interactive image segmentation using pairwise constraints
Sourati, Jamshid,
Erdogmus, Deniz,
Dy, Jennifer G,
and Brooks, Dana H
IEEE transactions on image processing
2014
Algorithms for fully automatic segmentation of images are often not sufficiently generic with suitable accuracy, and fully manual segmentation is not practical in many settings. There is a need for semiautomatic algorithms, which are capable of interacting with the user and taking into account the collected feedback. Typically, such methods have simply incorporated user feedback directly. Here, we employ active learning of optimal queries to guide user interaction. Our work in this paper is based on constrained spectral clustering that iteratively incorporates user feedback by propagating it through the calculated affinities. The original framework does not scale well to large data sets, and hence is not straightforward to apply to interactive image segmentation. In order to address this issue, we adopt advanced numerical methods for eigen-decomposition implemented over a subsampling scheme. Our key innovation, however, is an active learning strategy that chooses pairwise queries to present to the user in order to increase the rate of learning from the feedback. Performance evaluation is carried out on the Berkeley segmentation and Graz-02 image data sets, confirming that convergence to high accuracy levels is realizable in relatively few iterations.
AI-assisted Knowledge Discovery
Data on how science is made can make science better
Sourati, Jamshid,
Belikov, Alexander,
and Evans, James
Harvard Data Science Review
2022
Science is an engine of innovation and economic growth and a pathway to prosperity for countries around the world. The increasing availability of scientific publications today poses a data-driven opportunity to better understand and improve science. Scientific publications contain data on the content of published research and metadata on the context that gave rise to that research. Here we discuss and demonstrate the power of constructing, archiving, and analyzing links between scientific data and metadata to construct massive computational observatories of and for modern science. We show how these can be constructed using modern graph databases, and suggest some methods of analysis with potential to unleash sustained value for science and society. These scientific observatories would allow us to diagnose the health of the scientific workforce and institutions, and track the rate of scientific advance. They could enable us to better guide science policy and build portfolios of supported research that balance our societal commitments to diverse participation and prosperity. Moreover, they could enable scientists to surf the deluge of published research to open the scientific frontier in directions that do not follow the current, but open up new views and opportunities for others to follow. Linked scientific data can also enable the construction of artificial intelligence agents designed to complement the disciplinary focus of human scientific attention by proposing possibilities overlooked or underfunded by contemporary scientific institutions. Finally, we argue for the importance of ongoing political and legal support for the promotion of open, linked data to facilitate widespread benefit.
Systems and methods for high-order modeling of predictive hypotheses
Evans, James,
Shi, Feng,
and Sourati, Jamshid
2022
Embodiments disclosed herein receive a corpus of documents associated with a predictive hypothesis. The embodiments may generate a hypergraph comprising a plurality of nodes, the plurality of nodes including content nodes representing content elements from the documents and context nodes representing context elements of the documents, and hyperedges representing each document spanning two or more of the plurality of nodes. This hypergraph may be used to store a predictive hypothesis including a subset of the content elements, each content element of the subset of content elements having a vector representation meeting a predictive hypothesis threshold.
Accelerating science with human versus alien artificial intelligences
Sourati, Jamshid,
and Evans, James
arXiv e-prints
2021
Data-driven artificial intelligence models fed with published scientific findings have been used to create powerful prediction engines for scientific and technological advance, such as the discovery of novel materials with desired properties and the targeted invention of new therapies and vaccines. These AI approaches typically ignore the distribution of human prediction engines – scientists and inventor – who continuously alter the landscape of discovery and invention. As a result, AI hypotheses are designed to substitute for human experts, failing to complement them for punctuated collective advance. Here we show that incorporating the distribution of human expertise into self-supervised models by training on inferences cognitively available to experts dramatically improves AI prediction of future human discoveries and inventions. Including expert-awareness into models that propose (a) valuable energy-relevant materials increases the precision of materials predictions by 100%, (b) repurposing thousands of drugs to treat new diseases increases precision by 43%, and (c) COVID-19 vaccine candidates examined in clinical trials by 260%. These models succeed by predicting human predictions and the scientists who will make them. By tuning AI to avoid the crowd, however, it generates scientifically promising "alien" hypotheses unlikely to be imagined or pursued without intervention, not only accelerating but punctuating scientific advance. By identifying and correcting for collective human bias, these models also suggest opportunities to improve human prediction by reformulating science education for discovery.