Research Interests

My research interests focus on the geometry of statistical models and on its applications, in particular to stochastic optimization and machine learning. More in details, my studies develop along two correlated lines of research: one more theoretical, aimed at novel theoretical advances in Information Geometry, and the other one more applied, oriented to the design of novel and efficient optimization algorithms, especially in high-dimensions.

Theory of Information Geometry


On the more theoretical side, my research is centered on the study of the different non-Euclidean geometries which play an important role in the study of statistical manifolds. Besides the Riemannian geometry, characterized by the use of the Fisher information as a metric tensor, at least two other geometries are of great interest: the exponential and the mixture geometry (Amari and Nagaoka, 2000).

Closure of an Exponential Family
The exponential geometry of a statistical model is associated to the representation of probability distributions in the exponential form, hence it is restricted to probabilities with full support. In the exponential geometry, the topological closure of a model can be presented by the limits of sequences of probability distributions in an exponential family. On the other side, the mixture geometry, which is the geometry of the moments of a distribution, is well defined also on the boundary of a model, where the correspondence between exponential and mixture parameterizations is not defined. In the first phase, my research focused on the characterization of the closure on a discrete exponential family in terms of the support of the facets of the marginal polytope [MP10], i.e., the domain of the moments in the mixture geometry. The problem is of great interest from the perspective of the optimizer who wants to study the convergence of a gradient flow. Indeed there are functionals over a statistical model, such as the stochastic relaxation of a non-constant function, whose gradient flows converge to the boundary of an exponential family, where probability distributions have reduced support. To address the problem of the convergence to the boundary, I proposed an approach to extend the flows to non-positive probabilities, by representing discrete exponential models as ruled surfaces in the probability simplex. This provides a natural way to extend the vector fields outside of the model, and thus study convergence in open neighborhoods of the boundary of the model in the mixture geometry. The approach based on ruled surfaces, first presented in [MP15b], is expected to have considerable impact and motivate further research, since it allows a straightforward decomposition of functionals with respect to distributions on the boundary of the model, which is of great interests in statistical inference and information theory.

Natural Gradient and Dual Basis
The second theoretical contribution is about the study and the characterization of the dual bases for the tangent space of a statistical manifold in the exponential and mixture geometry [MP15a,MP15b], and the interpretation of the natural gradient as a projection of the Euclidean gradient over the tangent space of a statistical model [MMP11a]. The standard basis for the tangent space of a statistical manifold is given by the scores, which for an exponential family correspond to the centered sufficient statistics. The definition of the dual basis, given by a linear transformation with respect to the inverse Fisher information matrix, provides an intuitive geometrical interpretation, in terms of a change of basis, for the correspondence between the components of the natural gradient in the exponential or mixture geometry, and the components of the Euclidean gradient in the corresponding dual geometry. Such correspondence, which was previously interpreted as an application of the inversion function theorem [MMP13], is determinant for the derivation of computationally efficient formulas which do not require the computation of the inverse of the Fisher information matrix.

Alpha-Hessian over Statistical Manifolds
The third contribution is about the second-order Information Geometry, in particular I computed a formula for the Riemannian Hessian of a function defined over a statistical model, in the general case of the exponential family [MP14]. To my best knowledge such computation has never been shown in the literature, in particular in the specific case of the Hessian of the stochastic relaxation of a function, with respect to a distribution in the exponential family, where the estimation of the Riemannian Hessian reduces to an expression based on empirical covariances. In a subsequent paper [MP15c], I generalized the definition by replacing the Levi-Civita connection with the Amari alpha-connections, and thus introduced the exponential and mixture Hessians. These results are of great interest, since they motivate the construction of alternative second-order Taylor approximations of smooth functions used in many second-order optimization algorithms.

Stochastic Optimization over Statistical Manifolds


The second line of research is focused around the design, analysis, and experimental evaluation of novel and efficient algorithms for the optimization of functions defined over statistical models. What characterizes my approach to algorithm design, is the application of notions of differential geometry from the theory of optimization over manifold, which allow to take into account the non-Euclidean geometry of the search space. The main contributions of the research are summarized below.

An Information Geometric Perspective on Estimation of Distribution Algorithms
During my doctoral studies, I studied an application of Information Geometry to stochastic optimization, and I contributed to provide a better understanding of the working mechanisms of a large and heterogeneous family of optimization algorithms, called Estimation of Distribution Algorithms (EDAs) (Larranaga and Lozano, 2002), which make use of statistical models to guide the search for the optimum. In particular, I proved the underseen correspondence between the iterative update, based on fitness modelling, of a class of randomized algorithms called Distribution Estimation Using Markov Random Fields (DEUM) (Shakya and McCall, 2007], and the natural gradient descent update [MMP13].

Stochastic Algorithms for Combinatorial Optimization
My research inspired the design of three novel randomized algorithms based on stochastic optimization for combinatorial problems: Stochastic Natural Gradient Descent (SNGD) [MMP11a,MMP11b], a natural gradient algorithm based on the estimation of empirical covariances; l1-DEUM [VMM12], an algorithm in the DEUM framework, where model selection is based on l1-regularized linear regression; and Function Composition Algorithms (FCA) [CCMM12,CMM12], an EDA where model selection is replaced by a convenient transformation of the variables. All algorithms have been evaluated with respect to standard benchmarks in black-block optimization and showed competitive results with respect to the literature.

Natural Gradient for Gaussian Distributions with Restricted Covariance Matrix
More recently, my interests moved to continuous optimization, in particular to high-dimensional problems. My current research focuses on stochastic relaxation of continuous functions using models in the Gaussian distribution, characterized by a restricted covariance matrix, such as sparse inverse covariance matrices and rank-k updates of a diagonal matrix. For small k and for a small number of elements in the sparse inverse matrix, such models have a linear number of parameters in the dimension of the search space, and they admit efficient linear-time formulas of the natural gradient, compared to the quadratic case of Gaussian distributions with full covariance matrix. The formulas for the natural gradient for these models have never been derived before, despite their importance in high-dimensions due to their scalability. A paper with a description of the algorithm and the first experimental results is in preparation.

Online Learning with Selective Sampling


I am also interested in classification problems in machine learning, and in particular in online classification using selective sampling techniques. In online classification, the learner is provided with a sequence of observations which have to be classified online. After the prediction, the learner can access the label so that the classifier can be updated. In the selective sampling scenario, after the prediction the learner can chose whether or not to ask for the true label, in order to minimize the number of queries and at the same time obtain good prediction accuracies. I am interested in online classification algorithms with selective sampling in presence of multiple annotators, characterized by different costs and accuracies, where the learner has to chose not only when to ask for a label, but also which annotator to query, [MCBR14].


Back to the homepage