The 1st Workshop on Divergences and Divergence Learning (WDDL2013)

Date: June 20, 2013 (1-day workshop @ ICML 2013)

Location: Room Marquis 105, Atlanta Marriott Marquis, Atlanta, Georgia

Accepted Papers

Keynote Speakers

Joydeep Ghosh

Brian Kulis

Francisco Escolano Ruiz

Fei Sha

Kilian Weinberger

Eric Xing

Program Schedules

Eric Xing, Keynote speech, 8:30–9:15

Joydeep Ghosh, Keynote speech, 9:15–10:00

Coffee break 10:00–10:30

Jianing Shi, Wotao Yin, Stanley J. Osher, Linearized Bregman for l1-regularized Logistic Regression, 10:30–10:55

Oluwasanmi Koyejo, Joydeep Ghosh, A Representation Approach for Relative Entropy Minimization with Expectation Constraints, 10:55–11:20

Brian Kulis, Key note speech, 11:20–12:05

Lunch, 12:10-2:00

Fei Sha, Keynote speech, 2:00–2:45

Kilian Weinberger, Keynote speech, 2:45–3:30

Coffee break: 3:30–4:00

Madalina Fiterau, Artur Dubrawski, An Application of Divergence Estimation to Projection Retrieval for Semi-supervised Classification and Clustering, 4:00–4:25

Poster Session and Discussion 4:25–6:30

Titles and Abstracts of invited speeches

Learning Graphical Models with Constrained Divergence, Eric Xing

Graphical models (GMs) offer a powerful language to elegantly define expressive distributions, and a generic computational framework to support reasoning under uncertainty in a wide range of problems. Popular paradigms for training GMs include the maximum likelihood estimation, and more recently parametric and nonparametric Bayesian inference, max-margin learning, spectral learning, kernel embedding, etc., each enjoys some advantages, as well as weaknesses. For example, Bayesian nonparametrics allows flexible model selection and ability to model hidden variables, but does not particularly excel in a supervised setting where predictive margins can be explicitly exploited to bias the model in a data-driven fashion, along with other techniques such as the kernel tricks; whereas paradigms such as max-margin learning seems to offer the complementary. Therefore a natural question one may ask is: can we come to a new paradigm that conjoins the merits of all paradigms we have used so far? In this talk, I present a general framework called regularized posterior inference, which builds on a divergence-based objective over the desired posterior model in question, and admits direct additional regularizations of the model, such as those induced by max-margin due to supervision. This approach combines and extends merits from different learning approaches mentioned above and exhibits strong theoretical and empirical advantages. I will discuss a number of theoretical properties of this approach, and show its applications to learning a wide range of GMs including: fully supervised structured io model, max-margin structured io models with hidden variables, max-margin LDA-style models for jointly discovering “discriminative” latent topics and predictive tasks, and the infinite SVMs. Our empirical results strongly suggest that, for any GM with structured or unstructured labels, regularized posterior inference always leads to a more accurate predictive GM than the one trained under either MLE, Bayesian, or Max Margin alone.

Joint work with Jun Zhu.

Learning Bregman Divergences for Prediction with Generalized Linear Models, Joydeep Ghosh

Consider a prediction problem where the independent variable is obtained through a generalized linear model but with an unknown link function. In this situation the optimal solution involves determining both the unknown parameters as well as the link function, with the latter problem related to determining the best divergence to use. This talk reveals that the solution can indeed be found in a tractable manner for both batch and on-line settings with guaranteed solution quality and rates of convergence under fairly general conditions. This resolves the open problem of determining the most suitable Bregman divergence for predictive modeling involving GLMs.

Domain Adaptation and Structured Divergence Learning, Brian Kulis

Abstract This will be a talk in two parts. The first part will overview recent work on extending Mahalanobis metric learning to the domain adaptation problem. In this setting, one seeks a linear transformation that maps data from one domain (the source) to another domain (the target); as an example, the source domain may contain high-resolution images taken from a digital camera while the target domain may contain images taken with a low-resolution robot sensor. I will also discuss a recent formulation where one seeks to simultaneously cluster the source data into several sub-domains as well as to learn the mappings from each sub-domain to the target. This kind of problem can be cast in terms of a Bayesian network, and represents a structured transformation learning problem. Such problems lead into the second part of the talk, which is concerned with how Bregman divergences play a role in optimization for graphical models. The focus will be on small-variance asymptotics, which generalize the connection between the EM algorithm and k-means to a rich class of probabilistic models. Though still speculative at this point, the talk will conclude with some thoughts about connecting small-variance asymptotics to structured learning with Bregman divergences.

Non-linear Large Margin Nearest Neighbors Metric Learning, Kilian Quirin Weinberger

I provide an overview of our work on Large Margin Nearest Neighbor classification, a metric learning algorithm motivated by k-nearest neighbor classification, and several of its more recent variants. I also introduce two recent adaptations of LMNN: χ2-LMNN and GB-LMNN, which are explicitly designed to be non-linear and easy-to-use. The two approaches achieve this goal in fundamentally different ways: χ2-LMNN inherits the computational benefits of a linear mapping from linear metric learn- ing, but uses a non-linear χ2-distance to explicitly capture similarities within histogram data sets; GB-LMNN applies gradient-boosting to learn non-linear mappings directly in function space and takes advantage of this approach’s robustness, speed, parallelizability and insensitivity towards the single additional hyper-parameter.

Probabilistic Models of Learning Latent Similarity, Fei Sha

Inferring similarity among data instances is essential to many learning problems. So far, metric learning is the dominant paradigm. However, similarity is a richer and broader notion than what metrics entail. In this talk, I will describe Similarity Component Analysis (SCA), a new approach overcoming the limitation of metric learning algorithms. SCA is a probabilistic graphical model that discovers latent similarity structures. For a pair of data instances, SCA not only determines whether or they are similar but also reveal why they are similar (or dissimilar). Empirical studies on the benchmark tasks of multiway classification and link prediction show that SCA outperforms state-of-the-art metric learning algorithms.


In all applications that involve measuring the dissimilarity between two objects (numbers, vectors, matrices, functions, images and so on) the definition of a divergence/distance becomes necessary. There exist many popular divergences. The square loss function has been used widely for regression analysis; Kullback-Leibler divergence has been applied to compare the difference between probability density functions; Mahalanobis distance has been used to measure the dissimilarity between two random vectors of the same distribution; Itakura-Saito distance has been used for comparing the difference between positive numbers; Frobenius distance has been used to measure the dissimilarity between matrices. Note these divergences belong to the Bregman divergence class.

To choose a suitable divergence for specific applications, researchers proposed metric learning method to learn a Mahalanobis distance. Though more adaptive than the conventional Mahalanobis distance, however, this is very restricted because it requires the linear embedding of the data. As a comparison, Bregman Divergence learning is much broader. Unfortunately Bregman divergence is not sufficiently robust to noisy data as well as outliers. Furthermore, the L1-norm Bregman divergence center of a group of objects is always the average. The average is known as subject to noise and outliers. To overcome the lack of robustness, researchers proposed total Bregman divergence (tBD). tBD is robust to noisy data and outliers, besides, its L1-norm center is in closed form and less sensitive to noise and outliers. However, tBD learning seems challenging due to its complicated formulation.

In this workshop, we will explore the methods for learning an adaptive and robust divergence.

  1. Understand the motivation and benefits of divergence learning through (invited) talks about the state-of-arts researches and applications.

  2. Discuss whether divergence learning is necessary or not through panel discussions.

  3. Explore the pros and cons of various divergence learning techniques through open discussions.


  1. A. Banerjee, S. Merugu, I.S. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,” J. Mach. Learn. Res., vol. 6, pp. 1705–1749, 2005.

  2. L. Wu, R. Jin, S.C.H. Hoi, J. Zhu and N. Yu, “Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering,” Adv. in Neural Info. Process. Syst., vol. 23, 2009

  3. B.A. Frigyik, S. Srivastava, M.R. Gupta, “Functional Bregman Divergence and Bayesian Estimation of Distributions”, IEEE Trans. Information Theory, vol. 54, pp. 5130-5139, 2008

  4. E. Xing, A. Ng, M. Jordan, S. Russell, Distance metric learning, with applica- tion to clustering with side-information. Adv. in Neural Info. Process. Syst., 2002.

  5. A. Bar-Hillel, T. Hertz, N. Shental, A. Weinshall, “Learning a Mahalanobis metric from equivalence constraints”. J. Mach. Learn. Res., 2005.

  6. J. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, “Information-theoretic metric learning”. Int. Conf. Machine Learn.

  7. M. Liu, B.C. Vemuri, S.-I. Amari, and F. Nielsen, “Total Bregman divergence and its applications to shape retrieval,” IEEE Conf. Comp. Vis. Pattern Recogn, pp. 3463–3468, 2010.

Call for papers

Important Dates

  1. First Call for Papers: Febuary 20, 2013

  2. Submission Deadline (NEW): April 6, 2013 (Midnight EST)

  3. Notification of Acceptance: April 15, 2013

  4. Camera Ready Version: April 30, 2013

  5. Workshop Day: June 20, 2013

Call for Paper Submissions

We seek full paper submissions developing new or applying existing divergence and divergence learning methods to computer vision, machine learning, medical imaging and and other applications. The topics of interest include, but are not limited to:

  1. * Exploring new class of divergeces;

  2. * Introduce new methods of learning divergences;

  3. * Integrate existing divergences or divergence learning methods into fundamental applications;

  4. * Improve the current divergence learning methods;

  5. * Optimization of patient-management workflows

Author Guidelines*: guidelines

The full papers submitted to the workshop should be at most eight pages long following the ICML paper format with the author names and addresses included on the submission. The extended abstracts should be 1-2 pages long in the same format. The full-paper submissions will be considered and may be accepted for either an oral or a poster presentation, the extended abstracts may be accepted for the poster presentation only.

The submissions should be emailed to

The submission of papers and the management of the paper reviewing process will be entirely electronic. The formatting and page limit requirements of the papers are the same as the main conferences, which can be found from the conference web site.

Electronic submission: Submit your paper to Submitted papers can be up to eight pages long, not including references, and up to nine pages when references are included. Any paper exceeding this length will automatically be rejected. Authors have the option of submitting a supplementary file containing further details of their work; it is up to the reviewers whether they wish to consult this additional material.

All submissions must be anonymized and must closely follow the formatting guidelines in the templates; otherwise they will automatically be rejected.

Main Organizers

  1. Meizhu Liu, Siemens Corporation, Corporate Technology.

  2. Rong Jin, Computer Science and Engineering at Michigan State University.

  3. Chunhua Shen, School of Computer Science at University of Adelaide, Australia

  4. Jieping Ye, Computer Science and Engineering Department at Arizona State University.

  5. Zhi-Hua Zhou, LAMDA Group, National Key Lab for Novel Software Technology at Nanjing University, China


Meizhu Liu

Chunhua Shen