Accepted Papers

Authors can find instructions for the presentation of their paper here.

Deep Ordinal Reinforcement Learning (17)
Alexander Zap (TU Darmstadt), Tobias Joppen (TU Darmstadt), Johannes Fürnkranz (TU Darmstadt)

Reinforcement learning usually makes use of numerical rewards, which have nice properties but also come with drawbacks and difficulties.Using rewards on an ordinal scale (ordinal rewards) is an alternative to numerical rewards that has received more attention in recent years.In this paper, a general approach to adapting reinforcement learning problems to the use of ordinal rewards is presented and motivated.We show how to convert common reinforcement learning algorithms to an ordinal variation by the example of Q-learning and introduce Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal rewards.Additionally, we run evaluations on problems provided by the OpenAI Gym framework, showing that our ordinal variants exhibit a performance that is comparable tothe numerical variations for a number of problems.We also give first evidence that our ordinal variant is able to produce better results for problems with less engineered and simpler-to-design reward signals.
picture_as_pdf View full article Reproducible Research
Importance Weighted Generative Networks (21)
Maurice Diesendruck (The University of Texas at Austin), Ethan R. Elenberg (ASAPP, Inc.), Rajat Sen (Amazon, Inc.), Guy W. Cole (The University of Texas at Austin), Sanjay Shakkottai (The University of Texas at Austin), Sinead A. Williamson (The University of Texas at Austin; CognitiveScale)

While deep generative networks can simulate from complex data distributions, their utility can be hindered by limitations on the data available for training. Specifically, the training data distribution may differ from the target sampling distribution due to sample selection bias, or because the training data comes from a different but related distribution. We present methods to accommodate this difference via importance weighting, which allow us to estimate a loss function with respect to a target distribution even if we cannot access that distribution directly. These estimators, which differentially weight the contribution of data to the loss function, offer theoretical guarantees that heuristic approaches lack, while giving impressive empirical performance in a variety of settings.
picture_as_pdf View full article Reproducible Research
User-Guided Clustering in Heterogeneous Information Networks via Motif-Based Comprehensive Transcription (40)
Yu Shi (University of Illinois at Urbana-Champaign), Xinwei He (University of Illinois at Urbana-Champaign), Naijing Zhang (University of Illinois at Urbana-Champaign), Carl Yang (University of Illinois at Urbana-Champaign), Jiawei Han (University of Illinois at Urbana-Champaign)

Heterogeneous information networks (HINs) with rich semantics are ubiquitous in real-world applications. For a given HIN, many reasonable clustering results with distinct semantic meaning can simultaneously exist.User-guided clustering is hence of great practical value for HINs where users provide labels to a small portion of nodes.To cater to a broad spectrum of user guidance evidenced by different expected clustering results, carefully exploiting the signals residing in the data is potentially useful.Meanwhile, as one type of complex networks, HINs often encapsulate higher-order interactions that reflect the interlocked nature among nodes and edges.Network motifs, sometimes referred to as meta-graphs, have been used as tools to capture such higher-order interactions and reveal the many different semantics.We therefore approach the problem of user-guided clustering in HINs with network motifs.In this process, we identify the utility and importance of directly modeling higher-order interactions without collapsing them to pairwise interactions.To achieve this, we comprehensively transcribe the higher-order interaction signals to a series of tensors via motifs and propose the MoCHIN model based on joint non-negative tensor factorization.This approach applies to arbitrarily many, arbitrary forms of HIN motifs. An inference algorithm with speed-up methods is also proposed to tackle the challenge that tensor size grows exponentially as the number of nodes in a motif increases.We validate the effectiveness of the proposed method on two real-world datasets and three tasks, and MoCHIN outperforms all baselines in three evaluation tasks under three different metrics.Additional experiments demonstrated the utility of motifs and the benefit of directly modeling higher-order information especially when user guidance is limited.
picture_as_pdf View full article Reproducible Research
Tue, 14:40 - 15:00 @ 1.011 (Poster@Wed) Ranking
A Ranking Model Motivated by Nonnegative Matrix Factorization with Applications to Tennis Tournaments (44)
Rui Xia (National University of Singapore), Vincent Y. F. Tan (National University of Singapore), Louis Filstroff (IRIT, Université de Toulouse), Cédric Févotte (IRIT, Université de Toulouse)

We propose a novel ranking model that combines the Bradley-Terry-Luce probability model with a nonnegative matrix factorization framework to model and uncover the presence of latent variables that influence the performance of top tennis players. We derive an efficient, provably convergent, and numerically stable majorization-minimization-based algorithm to maximize the likelihood of datasets under the proposed statistical model. The model is tested on datasets involving the outcomes of matches between 20 top male and female tennis players over 14 major tournaments for men (including the Grand Slams and the ATP Masters 1000) and 16 major tournaments for women over the past 10 years. Our model automatically infers that the surface of the court (e.g., clay or hard court) is a key determinant of the performances of male players, but less so for females. Top players on various surfaces over this longitudinal period are also identified in an objective manner.
picture_as_pdf View full article Reproducible Research
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics (48)
Denis Steckelmacher (Vrije Universiteit Brussel), Hélène Plisnier (Vrije Universiteit Brussel), Diederik M. Roijers (VU Amsterdam), Ann Nowé (Vrije Universiteit Brussel)

Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks.
picture_as_pdf View full article Reproducible Research
Distributed Learning of Non-Convex Linear Models with One Round of Communication (55)
Mike Izbicki (Claremont McKenna College), Christian R. Shelton (UC Riverside)

We present the optimal weighted average (OWA) distributed learning algorithm for linear models. OWA achieves statistically optimal learning rates,uses only one round of communication,works on non-convex problems,and supports a fast cross validation procedure.The OWA algorithm first trains local models on each of the compute nodes;then a master machine merges the models using a second round of optimization. This second optimization uses only a small fraction of the data, and so has negligible computational cost.Compared with similar distributed estimators that merge locally trained models,OWA either has stronger statistical guarantees, is applicable to more models,or has a more computationally efficient merging procedure.
picture_as_pdf View full article
DEvIANT: Discovering Significant Exceptional (Dis-)Agreement Within Groups (58)
Adnene Belfodil (INSA Lyon), Wouter Duivesteijn (Technische Universiteit Eindhoven), Marc Plantevit (Univ Lyon), Sylvie Cazalens (INSA Lyon), Philippe Lamarre (INSA Lyon)

We strive to find contexts (i.e., subgroups of entities) under which exceptional (dis-)agreement occurs among a group of individuals, in any type of data featuring individuals (e.g., parliamentarians, customers) performing observable actions (e.g., votes, ratings) on entities (e.g., legislative procedures, movies). To this end, we introduce the problem of discovering statistically significant exceptional contextual intra-group agreement patterns. To handle the sparsity inherent to voting and rating data, we use Krippendorff's Alpha measure for assessing the agreement among individuals. We devise a branch-and-bound algorithm, named DEvIANT, to discover such patterns. DEvIANT exploits both closure operators and tight optimistic estimates. We derive analytic approximations for the confidence intervals (CIs) associated with patterns for a computationally efficient significance assessment. We prove that these approximate CIs are nested along specialization of patterns. This allows to incorporate pruning properties in DEvIANT to quickly discard non-significant patterns. Empirical study on several datasets demonstrates the efficiency and the usefulness of DEvIANT.
picture_as_pdf View full article Reproducible Research
A Framework for Deep Constrained Clustering - Algorithms and Advances (62)
Hongjing Zhang (University of California, Davis), Sugato Basu (Google Research), Ian Davidson (University of California, Davis)

The area of constrained clustering has been extensively explored by researchers and used by practitioners. Constrained clustering formulations exist for popular algorithms such as k-means, mixture models, and spectral clustering but have several limitations. A fundamental strength of deep learning is its flexibility, and here we explore a deep learning framework for constrained clustering and in particular explore how it can extend the field of constrained clustering. We show that our framework can not only handle standard together/apart constraints (without the well documented negative effects reported earlier) generated from labeled side information but more complex constraints generated from new types of side information such as continuous values and high-level domain knowledge.
picture_as_pdf View full article Reproducible Research
Exploiting the Earth's Spherical Geometry to Geolocate Images (63)
Mike Izbicki (Claremont McKenna College), Evangelos E. Papalexakis (University of California Riverside), Vassilis J. Tsotras (University of California Riverside)

Existing methods for geolocating images use standard classification or image retrieval techniques. These methods have poor theoreticalproperties because they do not take advantage of the earth'sspherical geometry. In some cases, they require training data sets thatgrow exponentially with the number of feature dimensions. This paperintroduces the first image geolocation method that exploits the earth'sspherical geometry. Our method is based on the Mixture of von-MisesFisher (MvMF) distribution, which is a spherical analogue of the popularGaussian mixture model. We prove that this method requires only adataset of size linear in the number of feature dimensions, and empiricalresults show that our method outperforms previous methods with ordersof magnitude less training data and computation.
picture_as_pdf View full article Reproducible Research
Unsupervised Sentence Embedding Using Document Structure-based Context (66)
Taesung Lee (IBM T.J. Watson Research Center), Youngja Park (IBM T.J. Watson Research Center)

We present a new unsupervised method for learning general-purpose sentence embeddings.Unlike existing methods which rely on local contexts,such as words inside the sentence or immediately neighboring sentences,our method selects, for each target sentence,influential sentences from the entire document based on the document structure.We identify a dependency structure of sentences using metadata and text styles.Additionally, we propose an out-of-vocabulary word handling techniquefor the neural network outputs to model many domain-specific termswhich were mostly discarded by existing sentence embedding training methods.We empirically show that the model relies on the proposed dependenciesmore than the sequential dependency in many cases.We also validate our model on several NLP tasksshowing 23
picture_as_pdf View full article
String Sanitization: A Combinatorial Approach (73)
Giulia Bernardini (University of Milano-Bicocca), Huiping Chen (King's College London), Alessio Conte (University of Pisa), Roberto Grossi (University of Pisa; INRIA, Lyon), Grigorios Loukides (King's College London), Nadia Pisanti (University of Pisa; INRIA, Lyon), Solon P. Pissis (INRIA, Lyon; CWI, Amsterdam), Giovanna Rosone (University of Pisa)

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user's location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility. First, we propose a time-optimal algorithm, TFS-ALGO, to construct the shortest string preserving the order of appearance and the frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. Second, we propose a time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms may reveal the location of sensitive patterns. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in these strings with carefully selected letters, so that sensitive patterns are not reinstated and occurrences of spurious patterns are prevented. We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO and then MCSR-ALGO and experimentally show that it is effective and efficient.
picture_as_pdf View full article Reproducible Research
A Differentially Private Kernel Two-Sample Test (85)
Anant Raj (Max Planck Institute for Intelligent Systems, Tübingen), Ho Chung Leon Law (University of Oxford), Dino Sejdinovic (University of Oxford), Mijung Park (Max Planck Institute for Intelligent Systems, Tübingen)

Kernel two-sample testing is a useful statistical tool in determining whether data samples arise from different distributions without imposing any parametric assumptions on those distributions. However, raw data samples can expose sensitive information about individuals who participate in scientific studies, which makes the current tests vulnerable to privacy breaches. Hence, we design a new framework for kernel two-sample testing conforming to differential privacy constraints, in order to guarantee the privacy of subjects in the data. Unlike existing differentially private parametric tests that simply add noise to data, kernel-based testing imposes a challenge due to a complex dependence of test statistics on the raw data, as these statistics correspond to estimators of distances between representations of probability measures in Hilbert spaces. Our approach considers finite dimensional approximations to those representations. As a result, a simple chi-squared test is obtained, where a test statistic depends on a mean and covariance of empirical differences between the samples, which we perturb for a privacy guarantee. We investigate the utility of our framework in two realistic settingsand conclude that our method requires only a relatively modest increase in sample size to achieve a similar level of power to the non-private tests in both settings.
picture_as_pdf View full article Reproducible Research
Linearly Constrained Weights: Reducing Activation Shift for Faster Training of Neural Networks (98)
Takuro Kutsuna (Toyota Central R&D Labs. Inc.)

In this paper, we first identify activation shift, a simple but remarkablephenomenon in aneural network in whichthe preactivation value of a neuron has non-zero mean that depends on the anglebetween the weight vector of the neuron and the mean of the activationvector in the previous layer. We then propose linearly constrained weights (LCW) to reducethe activation shift in both fully connected and convolutional layers.The impact of reducing the activation shift in a neural network is studiedfrom the perspective of how the variance of variables in the networkchanges through layer operations in both forward and backward chains.We also discuss its relationship to the vanishing gradient problem.Experimental results show that LCW enables a deep feedforward networkwith sigmoid activation functions to be trained efficientlyby resolving the vanishing gradient problem.Moreover, combined with batch normalization, LCW improvesgeneralization performance of both feedforward andconvolutional networks.
picture_as_pdf View full article
Joint Multi-Source Reduction (102)
Lei Zhang (Institute of Information Engineering, Chinese Academy of Sciences), Shupeng Wang (Institute of Information Engineering, Chinese Academy of Sciences), Xin Jin (National Computer Network Emergency Response Technical Team/Coordination Center of China), Siyu Jia (Institute of Information Engineering, Chinese Academy of Sciences)

The redundant sources problem in multi-source learning always exists in various real-world applications such as multimedia analysis, information retrieval, and medical diagnosis, in which the heterogeneous representations from different sources always have three-way redundancies. More seriously, the redundancies will cost a lot of storage space, cause high computational time, and degrade the performance of learner. This paper is an attempt to jointly reduce redundant sources. Specifically, a novel Heterogeneous Manifold Smoothness Learning (HMSL) model is proposed to linearly map multi-source data to a low-dimensional feature-isomorphic space, in which the information-correlated representations are close along manifold while the semantic-complementary instances are close in Euclidean distance. Furthermore, to eliminate three-way redundancies, we present a new Correlation-based Multi-source Redundancy Reduction (CMRR) method with 2,1-norm equation and generalized elementary transformation constraints to reduce redundant sources in the learned feature-isomorphic space. Comprehensive empirical investigations are presented that confirm the promise of our proposed framework.
picture_as_pdf View full article
Scalable Large Margin Gaussian Process Classification (116)
Martin Wistuba (IBM Research), Ambrish Rawat (IBM Research)

We introduce a new Large Margin Gaussian Process (LMGP) model by formulating a pseudo-likelihood for a generalised multi-class hinge loss.We derive a highly scalable training objective for the proposed model using variational-inference and inducing point approximation.Additionally, we consider the joint learning of LMGP-DNN which combines the proposed model with traditional Deep Learning methods to enable learning for unstructured data.We demonstrate the effectiveness of the Large Margin GP with respect to both training time and accuracy in an extensive classification experiment consisting of 68 structured and two unstructured data sets.Finally, we highlight the key capability and usefulness of our model in yielding prediction uncertainty for classification by demonstrating its effectiveness in the tasks of large-scale active learning and detection of adversarial images.
picture_as_pdf View full article
Adjustment Criteria for Recovering Causal Effects from Missing Data (122)
Mojdeh Saadati (Iowa State University), Jin Tian (Iowa State University)

Confounding bias, missing data, and selection bias are three common obstacles to valid causal inference in the data sciences. Covariate adjustment is the most pervasive technique for recovering casual effects from confounding bias. In this paper we introduce a covariate adjustment formulation for controlling confounding bias in the presence of missing-not-at-random data and develop a necessary and sufficient condition for recovering causal effects using the adjustment. We also introduce an adjustment formulation for controlling both confounding and selection biases in the presence of missing data and develop a necessary and sufficient condition for valid adjustment. Furthermore, we present an algorithm that lists all valid adjustment sets and an algorithm that finds a valid adjustment set containing the minimum number of variables, which are useful for researchers interested in selecting adjustment sets with desired properties.
picture_as_pdf View full article
Copy Mechanism and Tailored Training for Character-based Data-to-text Generation (145)
Marco Roberti (University of Turin), Giovanni Bonetta (University of Turin), Rossella Cancelliere (University of Turin), Patrick Gallinari (Sorbonne Université; Criteo AI Lab)

In the last few years, many different methods have been focusing on using deep recurrent neural networks for natural language generation. The most widely used sequence-to-sequence neural methods are word-based: as such, they need a pre-processing step called delexicalization (conversely, relexicalization) to deal with uncommon or unknown words. These forms of processing, however, give rise to models that depend on the vocabulary used and are not completely neural.In this work, we present an end-to-end sequence-to-sequence model with attention mechanism which reads and generates at a character level, no longer requiring delexicalization, tokenization, nor even lowercasing. Moreover, since characters constitute the common "building blocks" of every text, it also allows a more general approach to text generation, enabling the possibility to exploit transfer learning for training.These skills are obtained thanks to two major features: (*) the possibility to alternate between the standard generation mechanism and a copy one, which allows to directly copy input facts to produce outputs, and(*) the use of an original training pipeline that further improves the quality of the generated texts.We also introduce a new dataset called E2E+, designed to highlight the copying capabilities of character-based models, that is a modified version of the well-known E2E dataset used in the E2E Challenge. We tested our model according to five broadly accepted metrics (including the widely used bleu), showing that it yields competitive performance with respect to both character-based and word-based approaches.
picture_as_pdf View full article Reproducible Research
Thu, 15:00 - 15:20 @ 0.001 (Poster@Thu) Clustering
A Framework for Parallelizing Hierarchical Clustering Methods (149)
Silvio Lattanzi (Google Zürich), Thomas Lavastida (Carnegie Mellon University), Kefu Lu (Washington), Benjamin Moseley (Lee University)

Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include top-down divisive approaches such as bisecting k-means, k-median, and k-center and bottom-up agglomerative approaches such as single-linkage, average-linkage, and centroid-linkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the single-linkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods.We introduce efficient distributed algorithms for bisecting k-means, k-median, and k-center as well as centroid-linkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice.
picture_as_pdf View full article
Continual Rare-Class Recognition with Emerging Novel Subclasses (152)
Hung Nguyen (Carnegie Mellon University), Xuejian Wang (Carnegie Mellon University), Leman Akoglu (Carnegie Mellon University)

Given a labeled dataset that contains a rare (or minority) class of of-interest instances, as well as a large class of instances that are not of interest,how can we learn to recognize future of-interest instances over a continuous stream?We introduce RaRecognize, which (i) estimates a general decision boundary between the rare and the majority class, (ii) learns to recognize individual rare subclasses that exist within the training data, as well as (iii) flags instances from previously unseen rare subclasses as newly emerging.The learner in (i) is general in the sense that by construction it is dissimilar to the specialized learners in (ii), thus distinguishes minority from the majority without overly tuning to what is seen in the training data. Thanks to this generality, RaRecognize ignores all future instances that it labels as majorityand recognizes the recurrent as well as emerging rare subclasses only. This saves effort at test time as well as ensures that the model size grows moderately over time as it only maintains specialized minority learners.Through extensive experiments, we show that RaRecognize outperforms state-of-the art baselines on three real-world datasets that contain corporate-risk and disaster documents as rare classes.
picture_as_pdf View full article Reproducible Research
Unsupervised and Active Learning using Maximin-based Anomaly Detection (161)
Zahra Ghafoori (University of Melbourne), James C. Bezdek (University of Melbourne), Christopher Leckie (University of Melbourne), Shanika Karunasekera (University of Melbourne)

Unsupervised anomaly detection is commonly performed using a distance or density based technique, such as K-Nearest neighbours, Local Outlier Factor or One-class Support Vector Machines. One-class Support Vector Machines reduce the computational cost of testing new data by providing sparse solutions. However, all these techniques have relatively high computational requirements for training. Moreover, identifying anomalies based solely on density or distance is not sufficient when both point (isolated) and cluster anomalies exist in an unlabelled training set. Finally, these unsupervised anomaly detection techniques are not readily adapted for active learning, where the training algorithm should identify examples for which labelling would make a significant impact on the accuracy of the learned model. In this paper, we propose a novel technique called Maximin-based Anomaly Detection that addresses these challenges by selecting a representative subset of data in combination with a kernel-based model construction. We show that the proposed technique (a) provides a statistically significant improvement in the accuracy as well as the computation time required for training and testing compared to several benchmark unsupervised anomaly detection techniques, and (b) effectively uses active learning with a limited budget.
picture_as_pdf View full article Reproducible Research
Integrating Learning and Reasoning with Deep Logic Models (182)
Giuseppe Marra (University of Florence; University of Siena), Francesco Giannini (University of Siena), Michelangelo Diligenti (University of Siena), Marco Gori (University of Siena)

Deep learning is very effective at jointly learning feature representations and classification models, especially when dealing with high dimensional input patterns. Probabilistic logic reasoning, on the other hand, is capable of take consistent and robust decisions in complex environments. The integration of deep learning and logic reasoning is still an open-research problem and it is considered to be the key for the development of real intelligent agents. This paper presents Deep Logic Models, which are deep graphical models integrating deep learning and logic reasoning both for learning and inference. Deep Logic Models create an end-to-end differentiable architecture, where deep learners are embedded into a network implementing a continuous relaxation of the logic knowledge. The learning process allows to jointly learn the weights of the deep learners and the meta-parameters controlling the high-level reasoning. The experimental results show that the proposed methodology overcomes the limitations of the other approaches that have been proposed to bridge deep learning and reasoning.
picture_as_pdf View full article
LYRICS: a General Interface Layer to Integrate Logic Inference and Deep Learning (183)
Giuseppe Marra (University of Florence; University of Siena), Francesco Giannini (University of Siena), Michelangelo Diligenti (University of Siena), Marco Gori (University of Siena)

In spite of the amazing results obtained by deep learning in many applications, a real intelligent behavior of an agent acting in a complex environment is likely to require some kind of higher-level symbolic inference. Therefore, there is a clear need for the definition of a general and tight integration between low-level tasks, processing sensorial data that can be effectively elaborated using deep learning techniques, and the logic reasoning that allows humans to take decisions in complex environments.This paper presents LYRICS, a generic interface layer for AI, which is implemented in TersorFlow (TF). LYRICS provides an input language that allows to define arbitrary First Order Logic (FOL) background knowledge. The predicates and functions of the FOL knowledge can be bound to any TF computational graph, and the formulas are converted into a set of real-valued constraints, which participate to the overall optimization problem. This allows to learn the weights of the learners, under the constraints imposed by the prior knowledge.The framework is extremely general as it imposes no restrictions in terms of which models or knowledge can be integrated. In this paper, we show the generality of the approach showing some use cases of the presented language, including model checking, supervised learning and collective classification.
picture_as_pdf View full article
Online Linear Models for Edge Computing (185)
Hadar Sivan (Technion), Moshe Gabel (University of Toronto), Assaf Schuster (Technion)

Maintaining an accurate trained model on an infinite data stream is challenging due to concept drifts that render a learned model inaccurate.Updating the model periodically can be expensive, and so traditional approaches for computationally limited devices involve a variation of online or incremental learning, which tend to be less robust.The advent of heterogeneous architectures and Internet-connected devices gives rise to a new opportunity.A weak processor can call upon a stronger processor or a cloud server to perform a complete batch training pass once a concept drift is detected - trading power or network bandwidth for increased accuracy.We capitalize on this opportunity in two steps.We first develop a computationally efficient bound for changes in any linear model with convex, differentiable loss.We then propose a sliding window-based algorithm that uses a small number of batch model computations to maintain an accurate model of the data stream. It uses the bound to continuously evaluate the difference between the parameters of the existing model and a hypothetical optimal model, triggering computation only as needed.Empirical evaluation on real and synthetic datasets shows that our proposed algorithm adapts well to concept drifts and provides a better tradeoff between the number of model computations and model accuracy than classic concept drift detectors.When predicting changes in electricity prices, for example, we achieve 6
picture_as_pdf View full article
Novel Dense Subgraph Discovery Primitives: Risk Aversion and Exclusion Queries (192)
Charalampos E. Tsourakakis (Boston University), Tianyi Chen (Boston University), Naonori Kakimura (Keio University), Jakub Pachocki (OpenAI)

In the densest subgraph problem, given an undirected graph G(V,E,w) with non-negative edge weights we are asked to find a set of nodes S⊆ V that maximizes the degree density w(S)/|S|, where w(S) is the sum of the weights of the edges in the graph induced by S. This problem is solvable in polynomial time, and in general is well studied. But what happens when the edge weights are negative? Is the problem still solvable in polynomial time? Also, why should we care about the densest subgraph problem in the presence of negative weights? In this work we answer the aforementioned questions. Specifically, we provide two novel graph mining primitives that are applicable to a wide variety of applications. Our primitives can be used to answer questions such as "how can we find a dense subgraph in Twitter with lots of replies and mentions but no follows?", "how do we extract a dense subgraph with high expected reward and low risk from an uncertain graph"? We formulate both problems mathematically as special instances of dense subgraph discovery in graphs with negative weights. We study the hardness of the problem, and we prove that the problem in general is NP-hard, but we also provide sufficient conditions under which it is poly-time solvable. We design an efficient approximation algorithm that works best in the presence of small negative weights, and an effective heuristic for the more general case. Finally, we perform experiments on various real-world datasets that verify the value of the proposed primitives, and the effectiveness of our proposed algorithms.
picture_as_pdf View full article Reproducible Research
Interpretable Discriminative Dimensionality Reduction and Feature Selection on the Manifold (210)
Babak Hosseini (Bielefeld University), Barbara Hammer (Bielefeld University)

Dimensionality reduction (DR) on the manifold includes effectivemethods which project the data from an implicit relational space ontoa vectorial space. Regardless of the achievements in this area, these algorithmssuffer from the lack of interpretation of the projection dimensions.Therefore, it is often difficult to explain the physical meaning behindthe embedding dimensions. In this research, we propose the interpretablekernel DR algorithm (I-KDR) as a new algorithm which maps the datafrom the feature space to a lower dimensional space where the classes aremore condensed with less overlapping. Besides, the algorithm creates thedimensions upon local contributions of the data samples, which makes iteasier to interpret them by class labels. Additionally, we efficiently fusethe DR with feature selection task to select the most relevant features ofthe original space to the discriminative objective. Based on the empiricalevidence, I-KDR provides better interpretations for embedding dimensionsas well as higher discriminative performance in the embedded spacecompared to the state-of-the-art and popular DR algorithms.
picture_as_pdf View full article Reproducible Research
On the Stability of Feature Selection in the Presence of Feature Correlations (211)
Konstantinos Sechidis (University of Manchester), Konstantinos Papangelou (University of Manchester), Sarah Nogueira (Criteo, Paris), James Weatherall (Advanced Analytics Centre, Global Medicines Development, AstraZeneca, Cambridge), Gavin Brown (University of Manchester)

Feature selection is central to modern data science. The `stability' of a feature selection algorithm refers to the sensitivity of its choices to small changes in training data. This is, in effect, the robustness of the chosen features. This paper considers the estimation of stability when we expect strong pairwise correlations, otherwise known as feature redundancy. We demonstrate that existing measures are inappropriate here, as they systematically underestimate the true stability, giving an overly pessimistic view of a feature set.We propose a new statistical measure which overcomes this issue, and generalises previous work.
picture_as_pdf View full article Reproducible Research
The Elliptical Basis Function Data Descriptor (EBFDD) Network - A One-Class Classification Approach to Anomaly Detection (212)
Mehran H. Z. Bazargani (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin), Brian Mac Namee (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin)

This paper introduces the Elliptical Basis Function Data Descriptor (EBFDD) network, a one-class classification approach to anomaly detection based on Radial Basis Function (RBF) neural networks. The EBFDD network uses elliptical basis functions, which allows it to learn sophisticated decision boundaries while retaining the advantages of a shallow network. We have proposed a novel cost function, whose minimisation results in a trained anomaly detector that only requires examples of the normal class at training time. The paper includes a large benchmark experiment that evaluates the performance of EBFDD network and compares it to state of the art one-class classification algorithms including the One-Class Support Vector Machine and the Isolation Forest. The experiments show that, overall, the EBFDD network outperforms the state of the art approaches.
picture_as_pdf View full article Reproducible Research
Learning 3D Navigation Protocols on Touch Interfaces with Cooperative Multi-Agent Reinforcement Learning (213)
Quentin Debard (Itekube, LIRIS), Jilles Steeve Dibangoye (Inria, CITI-Lab, INSA-Lyon), Stéphane Canu (LITIS, INSA-Rouen), Christian Wolf (LIRIS, Inria, CITI-Lab, INSA-Lyon)

Using touch devices to navigate in virtual 3D environments such as computer assisted design (CAD) models or geographical information systems (GIS) is inherently difficult for humans, as the 3D operations have to be performed by the user on a 2D touch surface. This ill-posed problem is classically solved with a fixed and handcrafted interaction protocol, which must be learned by the user. We propose to automatically learn a new interaction protocol allowing to map a 2D user input to 3D actions in virtual environments using reinforcement learning (RL). A fundamental problem of RL methods is the vast amount of interactions often required, which are difficult to come by when humans are involved. To overcome this limitation, we make use of two collaborative agents. The first agent models the human by learning to perform the 2D finger trajectories. The second agent acts as the interaction protocol, interpreting and translating to 3D operations the 2D finger trajectories from the first agent. We restrict the learned 2D trajectories to be similar to a training set of collected human gestures by first performing state representation learning, prior to reinforcement learning. This state representation learning is addressed by projecting the gestures into a latent space learned by a variational auto encoder (VAE).
picture_as_pdf View full article
Unjustified Classification Regions and Counterfactual Explanations In Machine Learning (226)
Thibault Laugel (Sorbonne Université), Marie-Jeanne Lesot (Sorbonne Université), Christophe Marsala (Sorbonne Université), Xavier Renard (AXA, Paris), Marcin Detyniecki (Sorbonne Université; AXA, Paris; Polish Academy of Science)

Post-hoc interpretability approaches, although powerful tools to generate explanations for predictions made by a trained black-box model, have been shown to be vulnerable to issues caused by lack of robustness of the classifier. In particular, this paper focuses on the notion of explanation justification, defined as connectedness to ground-truth data, in the context of counterfactuals. In this work, we explore the extent of the risk of generating unjustified explanations. We propose an empirical study to assess the vulnerability of classifiers and show that the chosen learning algorithm heavily impacts the vulnerability of the model. Additionally, we show that state-of-the-art post-hoc counterfactual approaches can minimize the impact of this risk by generating less local explanations.
picture_as_pdf View full article Reproducible Research
Deep Eyedentification: Biometric Identification using Micro-Movements of the Eye (231)
Lena A. Jäger (University of Potsdam), Silvia Makowski (University of Potsdam), Paul Prasse (University of Potsdam), Sascha Liehr (Independent researcher), Maximilian Seidler (University of Potsdam), Tobias Scheffer (University of Potsdam)

We study involuntary micro-movements of the eye for biometric identification. While prior studies extract lower-frequency macro-movements from the output of video-based eye-tracking systems and engineer explicit features of these macro-movements, we develop a deep convolutional architecture that processes the raw eye-tracking signal. Compared to prior work, the network attains a lower error rate by one order of magnitude and is faster by two orders of magnitude: it identifies users accurately within seconds.
picture_as_pdf View full article Reproducible Research
SLSGD: Secure and Efficient Distributed On-device Machine Learning (239)
Cong Xie (University of Illinois at Urbana-Champaign), Oluwasanmi Koyejo (University of Illinois at Urbana-Champaign), Indranil Gupta (University of Illinois at Urbana-Champaign)

We consider distributed on-device learning with limited communication and security requirements. We propose a new robust distributed optimization algorithm with efficient communication and attack tolerance. The proposed algorithm has provable convergence and robustness under non-IID settings. Empirical results show that the proposed algorithm stabilizes the convergence and tolerates data poisoning on a small number of workers.
picture_as_pdf View full article Reproducible Research
Neural Control Variates for Monte Carlo Variance Reduction (265)
Ruosi Wan (Peking University), Mingjun Zhong (University of Lincoln), Haoyi Xiong (Baidu Inc.), Zhanxing Zhu (Peking University; Beijing Institute of Big Data Research)

In statistics and machine learning, approximation of an intractable integration is often achieved by using the unbiased Monte Carlo estimator, but the variances of the estimation are generally high in many applications. Control variates approaches are well-known to reduce the variance of the estimation. These control variates are typically constructed by employing predefined parametric functions or polynomials, determined by using those samples drawn from the relevant distributions. Instead, we propose to construct those control variates by learning neural networks to handle the cases when test functions are complex. In many applications, obtaining a large number of samples for Monte Carlo estimation is expensive, the adoption of the original loss function may result in severe overfitting when training a neural network. This issue was not reported in those literature on control variates with neural networks. We thus further introduce a constrained control variates with neural networks to alleviate the overfitting issue. We apply the proposed control variates to both toy and real data problems, including a synthetic data problem, Bayesian model evidence evaluation and Bayesian neural networks. Experimental results demonstrate that our method can achieve significant variance reduction compared to other methods.
picture_as_pdf View full article
Node Representation Learning for Directed Graphs (275)
Megha Khosla (L3S Resaerch Center, Hannover), Jurek Leonhardt (L3S Resaerch Center, Hannover), Wolfgang Nejdl (L3S Resaerch Center, Hannover), Avishek Anand (L3S Resaerch Center, Hannover)

We propose a novel approach for learning node representations in directed graphs, which maintains separate views or embedding spaces for the two distinct node roles induced by the directionality of the edges.We argue that the previous approaches either fail to encode the edge directionality or their encodings cannot be generalized across tasks.With our simple alternating random walk strategy, we generate role specific vertex neighborhoods and train node embeddings in their corresponding source/target roles while fully exploiting the semantics of directed graphs. We also unearth the limitations of evaluations on directed graphs in previous works and propose a clear strategy for evaluating link prediction and graph reconstruction in directed graphs. We conduct extensive experiments to showcase our effectiveness on several real-world datasets on link prediction, node classification and graph reconstruction tasks. We show that the embeddings from our approach are indeed robust, generalizable and well performing across multiple kinds of tasks and graphs. We show that we consistently outperform all baselines for node classification task. In addition to providing a theoretical interpretation of our method we also show that we are considerably more robust than the other directed graph approaches.
picture_as_pdf View full article Reproducible Research
Adversarial Invariant Feature Learning with Accuracy Constraint for Domain Generalization (278)
Kei Akuzawa (University of Tokyo), Yusuke Iwasawa (University of Tokyo), Yutaka Matsuo (University of Tokyo)

Learning domain-invariant representation is a dominant approach for domain generalization (DG), where we need to build a classifier that is robust toward domain shifts.However, previous domain-invariance-based methods overlooked the underlying dependency of classes on domains, which is responsible for the trade-off between classification accuracy and domain invariance.Because the primary purpose of DG is to classify unseen domains rather than the invariance itself, the improvement of the invariance can negatively affect DG performance under this trade-off.To overcome the problem, this study first expands the analysis of the trade-off by Xie et. al., and provides the notion of accuracy-constrained domain invariance, which means the maximum domain invariance within a range that does not interfere with accuracy.We then propose a novel method adversarial feature learning with accuracy constraint (AFLAC), which explicitly leads to that invariance on adversarial training.Empirical validations show that the performance of AFLAC is superior to that of domain-invariance-based methods on both synthetic and three real-world datasets, supporting the importance of considering the dependency and the efficacy of the proposed method.
picture_as_pdf View full article Reproducible Research
Data Association with Gaussian Processes (286)
Markus Kaiser (Siemens AG; Technical University of Munich), Clemens Otte (Siemens AG), Thomas A. Runkler (Siemens AG; Technical University of Munich), Carl Henrik Ek (University of Bristol)

The data association problem is concerned with separating data coming from different generating processes, for example when data comes from different data sources, contain significant noise, or exhibit multimodality.We present a fully Bayesian approach to this problem.Our model is capable of simultaneously solving the data association problem and the induced supervised learning problem.Underpinning our approach is the use of Gaussian process priors to encode the structure of both the data and the data associations.We present an efficient learning scheme based on doubly stochastic variational inference and discuss how it can be applied to deep Gaussian process priors.
picture_as_pdf View full article
PP-PLL: Probability Propagation for Partial Label Learning (296)
Kaiwei Sun (Chongqing University of Posts), Zijian Min (Telecommunications)

Partial label learning (PLL) is a weakly supervised learning framework which learns from the data where each example is associated with a set of candidate labels, among which only one is correct. Most existing approaches are based on the disambiguation strategy, which either identifies the valid label iteratively or treats each candidate label equally based on the averaging strategy. In both cases, the disambiguation strategy shares a common shortcoming that the ground-truth label may be overwhelmed by the false positive candidate labels, especially when the number of candidate labels becomes large. In this paper, a probability propagation method for partial label learning (PP-PLL) is proposed. Specifically, based on the manifold assumption, a biconvex regular function is proposed to model the linear mapping relationships between input features and output true labels. In PP-PLL, the topological relations among training samples are used as additional information to strengthen the mutual exclusiveness among candidate labels, which helps to prevent the ground-truth label from being overwhelmed by a large number of candidate labels. Experimental studies on both artificial and real-world data sets demonstrate that the proposed PP-PLL method can achieve superior or comparable performance against the state-of-the-art methods.
picture_as_pdf View full article Reproducible Research
Quantile Layers: Statistical Aggregation in Deep Neural Networks for Eye Movement Biometrics (321)
Ahmed Abdelwahab (Leibniz Institute of Agricultural Engineering), Niels Landwehr (Bioeconomy e.V. (ATB), Potsdam)

Human eye gaze patterns are highly individually characteristic. Gaze patterns observed during the routine access of a user to a device or document can therefore be used to identify subjects unobtrusively,that is, without the need to perform an explicit verification such as entering a password.Existing approaches to biometric identification from gaze patterns segment raw gaze data into short, local patterns called saccades and fixations. Subjects are then identified by characterizing the distribution of these patterns or deriving hand-crafted features for them.In this paper, we follow a different approach by training deep neural networks directly on the raw gaze data.As the distribution of short, local patterns has been shown to be particularly informative for distinguishing subjects, we introduce a parameterized and end-to-end learnable statistical aggregation layer called the quantile layer that enables the network to explicitly fit the distribution offilter activations in preceeding layers. We empirically show that deep neural networks with quantile layers outperform existing probabilistic and feature-based methods for identifying subjectsbased on eye movements by a large margin.
picture_as_pdf View full article Reproducible Research
Thu, 15:20 - 15:40 @ 0.001 (Poster@Thu) Clustering
Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations (327)
Dmitry Kobak (University of Tübingen), George Linderman (Yale University), Stefan Steinerberger (Yale University), Yuval Kluger (Yale University; Yale School of Medicine), Philipp Berens (University of Tübingen)

T-distributed stochastic neighbour embedding (t-SNE) is a widely used data visualisation technique. It differs from its predecessor SNE by the low-dimensional similarity kernel: the Gaussian kernel was replaced by the heavy-tailed Cauchy kernel, solving the `crowding problem' of SNE. Here, we develop an efficient implementation of t-SNE for a t-distribution kernel with an arbitrary degree of freedom ν, with ν→∞ corresponding to SNE and ν=1 corresponding to the standard t-SNE. Using theoretical analysis and toy examples, we show that ν<1 can further reduce the crowding problem and reveal finer cluster structure that is invisible in standard t-SNE. We further demonstrate the striking effect of heavier-tailed kernels on large real-life data sets such as MNIST, single-cell RNA-sequencing data, and the HathiTrust library. We use domain knowledge to confirm that the revealed clusters are meaningful. Overall, we argue that modifying the tail heaviness of the t-SNE kernel can yield additional insight into the cluster structure of the data.
picture_as_pdf View full article Reproducible Research
Thu, 14:40 - 15:00 @ 0.001 (Poster@Thu) Clustering
Uncovering Hidden Block Structure for Clustering (337)
Luce le Gorrec (Université de Toulouse), Sandrine Mouysset (Université de Toulouse), Philip A. Knight (STFC Rutherford Appleton Laboratory; CERFACS, Toulouse), Iain S. Duff (University of Strathclyde), Daniel Ruiz (Université de Toulouse)

We present a multistage procedure to cluster directed and undirected weighted graphs by finding the block structure of their adjacency matrices. A central part of the process is to scale the adjacency matrix into a doubly-stochastic form, which permits detection of the whole matrix block structure with minimal spectral information (theoretically a single pair of singular vectors suffices).We present the different stages of our method, namely the impact of the doubly-stochastic scaling on singular vectors, detection of the block structure by means of these vectors, and details such as cluster refinement and a stopping criterion. Then we test thealgorithm's effectiveness by using it on two unsupervised classification tasks: community detection in networks and shape detection in cloudsof points in two dimensions. By comparing results of our approach with thoseof widely used algorithms designed for specific purposes, we observe that our method is competitive (for community detection) if not superior (for shape detection) in comparison with existing methods.
picture_as_pdf View full article Reproducible Research
Safe Policy Improvement with Soft Baseline Bootstrapping (339)
Kimia Nadjahi (Télécom Paris), Romain Laroche (Microsoft Research Montréal), Rémi Tachet des Combes (Microsoft Research Montréal)

Batch Reinforcement Learning (Batch RL) consists in training a policy using trajectories collected with another policy, called the behavioural policy. Safe policy improvement (SPI) provides guarantees with high probability that the trained policy performs better than the behavioural policy, also called baseline in this setting. Previous work shows that the SPI objective improves mean performance as compared to using the basic RL objective, which boils down to solving the MDP with maximum likelihood. Here, we build on that work and improve more precisely the SPI with Baseline Bootstrapping algorithm (SPIBB) by allowing the policy search over a wider set of policies. Instead of binarily classifying the state-action pairs into two sets (the uncertain and the safe-to-train-on ones), we adopt a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty. The method can take more risks on uncertain actions all the while remaining provably-safe, and is therefore less conservative than the state-of-the-art methods. We propose two algorithms (one optimal and one approximate) to solve this constrained optimization problem and empirically show a significant improvement over existing SPI algorithms both on finite MDPS and on infinite MDPs with a neural network function approximation.
picture_as_pdf View full article Reproducible Research
NSEEN: Neural Semantic Embedding for Entity Normalization (383)
Shobeir Fakhraei (University of Southern California), Joel Mathew (University of Southern California), José Luis Ambite (University of Southern California)

Much of human knowledge is encoded in text, available in scientific publications, books, and the web. Given the rapid growth of these resources, we need automated methods to extract such knowledge into machine-processable structures, such as knowledge graphs. An important task in this process is entity normalization, which consists of mapping noisy entity mentions in text to canonical entities in well-known reference sets. However, entity normalization is a challenging problem; there often are many textual forms for a canonical entity that may not be captured in the reference set, and entities mentioned in text may include many syntactic variations, or errors. The problem is particularly acute in scientific domains, such as biology. To address this problem, we have developed a general, scalable solution based on a deep Siamese neural network model to embed the semantic information about the entities, as well as their syntactic variations. We use these embeddings for fast mapping of new entities to large reference sets, and empirically show the effectiveness of our framework in challenging bio-entity normalization datasets.
picture_as_pdf View full article Reproducible Research
Tue, 15:00 - 15:20 @ 1.011 (Poster@Wed) Ranking
A Reduction of Label Ranking to Multiclass Classification (385)
Klaus Brinker (Hamm-Lippstadt University of Applied Sciences), Eyke Hüllermeier (Paderborn University)

Label ranking considers the problem of learning a mapping from instances to strict total orders over a predefined set of labels. In this paper, we present a framework for label ranking using a decomposition into a set of multiclass problems. Conceptually, our approach can be seen as a generalization of pairwise preference learning. In contrast to the latter, it allows for controlling the granularity of the decomposition, varying between binary preferences and complete rankings as extreme cases. It is specifically motivated by limitations of pairwise learning with regard to the minimization of certain loss functions. We discuss theoretical properties of the proposed method in terms of accuracy, error correction, and computational complexity. Experimental results are promising and indicate that improvements upon the special case of pairwise preference decomposition are indeed possible.
picture_as_pdf View full article
Tue, 14:20 - 14:40 @ 1.011 (Poster@Wed) Ranking
Learning to Calibrate and Rerank Multi-label Predictions (391)
Cheng Li (Northeastern University), Virgil Pavlu (Northeastern University), Javed Aslam (Northeastern University), Bingyu Wang (Northeastern University), Kechen Qin (Northeastern University)

A multi-label classifier assigns a set of labels to each data object. A natural requirement in many end-use applications is that the classifier also provides a well-calibrated confidence (probability) to indicate the likelihood of the predicted set being correct; for example, an application may automate high-confidence predictions while manually verifying low-confidence predictions. The simplest multi-label classifier, called Binary Relevance (BR), applies one binary classifier to each label independently and takes the product of the individual label probabilities as the overall label-set probability (confidence). Despite its many known drawbacks, such as generating suboptimal predictions and poorly calibrated confidence scores, BR is widely used in practice due to its speed and simplicity. We seek in this work to improve both BR's confidence estimation and prediction through a post calibration and reranking procedure. We take the BR predicted set of labels and its product score as features, extract more features from the prediction itself to capture label constraints, and apply Gradient Boosted Trees (GB) as a calibrator to map these features into a calibrated confidence score. GB not only produces well-calibrated scores (aligned with accuracy and sharp), but also models label interactions, correcting a critical flaw in BR. We further show that reranking label sets by the new calibrated confidence makes accurate set predictions on par with state-of-the-art multi-label classifiers - yet calibrated, simpler, and faster.
picture_as_pdf View full article Reproducible Research
Tue, 14:00 - 14:20 @ 1.011 (Poster@Wed) Ranking
Pairwise Learning to Rank by Neural Networks Revisited: Reconstruction, Theoretical Analysis and Practical Performance (400)
Marius Köppel (Johannes Gutenberg-Universität Mainz), Alexander Segner (Johannes Gutenberg-Universität Mainz), Martin Wagener (Johannes Gutenberg-Universität Mainz), Lukas Pensel (Johannes Gutenberg-Universität Mainz), Andreas Karwath (University of Birmingham), Stefan Kramer (Johannes Gutenberg-Universität Mainz)

We present a pairwise learning to rank approach based on a neural net, called DirectRanker, that generalizes the RankNet architecture. We show mathematically that our model is reflexive, antisymmetric, and transitive allowing for simplified training and improved performance. Experimental results on the LETOR MSLR-WEB10K, MQ2007 and MQ2008 datasets show that our model outperforms numerous state-of-the-art methods, while being inherently simpler in structure and using a pairwise approach only.
picture_as_pdf View full article Reproducible Research
Practical Open-Loop Optimistic Planning (407)
Edouard Leurent (SequeL team, INRIA Lille - Nord Europe; Renault Group), Odalric-Ambrym Maillard (SequeL team, INRIA Lille - Nord Europe)

We consider the problem of online planning in a Markov Decision Process when given only access to a generative model, restricted to open-loop policies - i.e. sequences of actions - and under budget constraint. In this setting, the Open-Loop Optimistic Planning (OLOP) algorithm enjoys good theoretical guarantees but is overly conservative in practice, as we show in numerical experiments. We propose a modified version of the algorithm with tighter upper-confidence bounds, KL-OLOP, that leads to better practical performances while retaining the sample complexity bound. Finally, we propose an efficient implementation that significantly improves the time complexity of both algorithms.
picture_as_pdf View full article Reproducible Research
Link Prediction via Higher-Order Motif Features (408)
Ghadeer Abuoda (College of Science & Engineering, HBKU, Doha), Gianmarco De Francisci Morales (ISI Foundation, Turin), Ashraf Aboulnaga (Qatar Computing Research Institute, Doha)

Link prediction requires predicting which new links are likely to appear in a graph.In this paper, we present an approach for link prediction that relies on higher-order analysis of the graph topology, well beyond the typical approach which relies on common neighbors.We treat the link prediction problem as a supervised classification problem,and we propose a set of features that depend on the patterns or motifs that a pair of nodes occurs in.By using motifs of sizes 3, 4, and 5, our approach captures a high level of detail about the graph topology. In addition,we propose two optimizations to construct the classification dataset from the graph.First, we propose adding negative examples to the graph as an alternative to the common approach of removing positive ones.Second, we show that it is important to control for the shortest-path distance when sampling pairs of nodes to form negative examples, since the difficulty of prediction varies with the distance.We experimentally demonstrate that using our proposed motif features in off-the-shelf classifiers results in up to 10 percentage points increase in accuracy over prior topology-based and feature-learning methods.
picture_as_pdf View full article Reproducible Research
Multitask Hopfield Networks (418)
Marco Frasca (Università degli Studi di Milano), Giuliano Grossi (Università degli Studi di Milano), Giorgio Valentini (Università degli Studi di Milano)

Multitask algorithms typically use task similarity information as a bias to speed up and improve the performance of learning processes. Tasks are learned jointly, sharing information across them, inorder to construct models more accurate than those learned separatelyover single tasks. In this contribution, we present the first multitaskmodel, to our knowledge, based on Hopfield Networks (HNs), namedHoMTask. We show that by appropriately building a unique HN embedding all tasks, a more robust and effective classification model can be learned. HoMTask is a transductive semi-supervised parametric HN, thatminimizes an energy function extended to all nodes and to all tasks understudy. We provide theoretical evidence that the optimal parameters automatically estimated by HoMTask make coherent the model itself with the prior knowledge (connection weights and node labels). The convergence properties of HNs are preserved, and the fixed point reached by the network dynamics gives rise to the prediction of unlabeled nodes. The proposed model improves the classification abilities of singletask HNs on a preliminary benchmark comparison, and achievescompetitive performance with state-of-the-art semi-supervised graph-based algorithms.
picture_as_pdf View full article
An Engineered Empirical Bernstein Bound (435)
Mark A. Burgess (Australian National University), Archie C. Chapman (University of Sydney), Paul Scott (Australian National University)

We derive a tightened empirical Bernstein bound (EBB) on the variation of the sample mean from the population mean, and show that it improves the performance of upper confidence bound (UCB) methods in multi-armed bandit problems.Like other EBBs, our EBB is a concentration inequality for the variation of the sample mean in terms of the sample variance.Its derivation uses a combination of probability unions and Chernoff bounds for the mean of samples and mean of sample squares.Analysis reveals that our approach can tighten the best existing EBBs by about a third, and thereby halves the distance to a bound constructed with perfect variance information.We illustrate the practical usefulness of our novel EBB by applying it to a multi-armed bandit problem as a component of a UCB method. Our method outperforms existing approaches by producing lower expected regret than variants of UCB employing several other bounds, including state-of-the-art EBBs.
picture_as_pdf View full article Reproducible Research
Neural Message Passing for Multi-Label Classification (438)
Jack Lanchantin (University of Virginia), Arshdeep Sekhon (University of Virginia), Yanjun Qi (University of Virginia)

Multi-label classification (MLC) is the task of assigning a set of target labels for a given sample. Modeling the combinatorial label interactions in MLC has been a long-haul challenge. We propose Label Message Passing (LaMP) Neural Networks to efficiently model the joint prediction of multiple labels. LaMP treats labels as nodes on a label-interaction graph and computes the hidden representation of each label node conditioned on the input using attention-based neural message passing. Attention enables LaMP to assign different importances to neighbor nodes per label, learning how labels interact (implicitly). The proposed models are simple, accurate, interpretable, structure-agnostic, and applicable for predicting dense labels since LaMP is incredibly parallelizable. We validate the benefits of LaMP on seven real-world MLC datasets, covering a broad spectrum of input/output types and outperforming the state-of-the-art results. Notably, LaMP enables intuitive interpretation of how classifying each label depends on the elements of a sample and at the same time rely on its interaction with other labels.
picture_as_pdf View full article Reproducible Research
CatchCore: Catching Hierarchical Dense Subtensor (451)
Wenjie Feng (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences), Shenghua Liu (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences), Huawei Shen (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences), Xueqi Cheng (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences)

Dense subtensor detection gains remarkable success in spotting anomaly and fraudulent behaviors for the multi-aspect data (i.e., tensors), like in social media and event streams.Existing methods detect the densest subtensors flatly and separately, with an underlying assumption that those subtensors are exclusive.However, many real-world tensors usually present hierarchical properties, e.g., the core-periphery structure or dynamic communities in networks. In this paper, we propose CatchCore, a novel framework to effectively find the hierarchical dense subtensors. We first design a unified metric for dense subtensor detection, which can be optimized with gradient-based methods. With the proposed metric, detects hierarchical dense subtensors through the hierarchy-wise alternative optimization.Finally, we utilize the minimum description length principle to measure the quality of detection result and select the optimal hierarchical dense subtensors.Extensive experiments on synthetic and real-world datasets demonstrate that outperforms the top competitors in accuracy for detecting dense subtensors and anomaly patterns. Additionally, CatchCore successfully identified a hierarchical researcher co-authorship group with intense interactions in DBLP dataset. Meanwhile, CatchCore also scales linearly with all aspects of tensors.
picture_as_pdf View full article Reproducible Research
Fast and Parallelizable Ranking with Outliers from Pairwise Comparisons (468)
Sungjin Im (University of California), Mahshid Montazer Qaem (University of California)

In this paper, we initiate the study of the problem of ordering objects from their pairwise comparison results when allowed to discard up to a certain number of objects as outliers. More specifically, we seek to find an ordering under the popular Kendall tau distance measure, i.e., minimizing the number of pairwise comparison results that are inconsistent with the ordering, with some outliers removed. The presence of outliers challenges the assumption that a global consistent ordering exists and obscures the measure. This problem does not admit a polynomial time algorithm unless NP ⊆ BPP, and therefore, we develop approximation algorithms with provable guarantees for all inputs. Our algorithms have running time and memory usage that are almost linear in the input size. Further, they are readily adaptable to run on massively parallel platforms such as MapReduce or Spark.
picture_as_pdf View full article Reproducible Research
SoRecGAT: Leveraging Graph Attention Mechanism for Top-N Social Recommendation (475)
Vijaikumar M (Indian Institute of Science), Shirish Shevade (Indian Institute of Science), M. N. Murty (Indian Institute of Science)

Social recommendation systems typically combine extra information like a social network with the user-item interaction network in order to alleviate data sparsity issues. This also helps in making more accurate and personalized recommendations. However, most of the existing systems work under the assumption that all socially connected users have equal influence on each other in a social network, which is not true in practice. Further, estimating the quantum of influence that exists among entities in a user-item interaction network is essential when only implicit ratings are available. This has been ignored even in many recent state-of-the-art models such as SAMN (Social Attentional Memory Network) and DeepSoR (Deep neural network model on Social Relations). Many a time, capturing a complex relationship between the entities (users/items) is essential to boost the performance of a recommendation system. We address these limitations by proposing a novel neural network model, SoRecGAT, which employs multi-head and multi-layer graph attention mechanism. The attention mechanism helps the model learn the influence of entities on each other more accurately. The proposed model also takes care of heterogeneity among the entities seamlessly. SoRecGAT is a general approach and we also validate its suitability when information in the form of a network of co-purchased items is available. Empirical results on eight real-world datasets demonstrate that the proposed model outperforms state-of-the-art models.
picture_as_pdf View full article Reproducible Research
Shift Happens: Adjusting Classifiers (479)
Theodore James Thibault Heiser (University of Tartu), Mari-Liis Allikivi (University of Tartu), Meelis Kull (University of Tartu)

Minimizing expected loss measured by a proper scoring rule, such as Brier score or log-loss (cross-entropy), is a common objective while training a probabilistic classifier. If the data have experienced dataset shift where the class distributions change post-training, then often the model's performance will decrease, over-estimating the probabilities of some classes while under-estimating the others on average. We propose unbounded and bounded general adjustment (UGA and BGA) methods that transform all predictions to (re-)equalize the average prediction and the class distribution. These methods act differently depending on which proper scoring rule is to be minimized, and we have a theoretical guarantee of reducing loss on test data, if the exact class distribution is known. We also demonstrate experimentally that, when in practice the class distribution is known only approximately, there is often still a reduction in loss depending on the amount of shift and the precision to which the class distribution is known.
picture_as_pdf View full article Reproducible Research
Stochastic Activation Actor Critic Methods (483)
Wenling Shang (University of Amsterdam-Bosch-Deltalab), Herke van Hoof (University of Amsterdam-Bosch-Deltalab), Max Welling (University of Amsterdam-Bosch-Deltalab)

Stochastic elements in reinforcement learning (RL) have shown promise to improve exploration and handling of uncertainty, such as the utilization of stochastic weights in NoisyNets and stochastic policies in the maximum entropy RL frameworks. Yet effective and general approaches to include such elements in actor-critic models are still lacking. Inspired by the aforementioned techniques, we propose an effective way to inject randomness into actor-critic models to improve general exploratory behavior and reflect environment uncertainty. Specifically, randomness is added at the level of intermediate activations that feed into both policy and value functions to achieve better correlated and more complex perturbations. The proposed framework also features flexibility and simplicity, which allows straightforward adaptation to a variety of tasks. We test several actor-critic models enhanced with stochastic activations and demonstrate their effectiveness in a wide range of Atari 2600 games, a continuous control problem and a car racing task. Lastly, in a qualitative analysis, we present evidence of the proposed model adapting the noise in the policy and value functions to reflect uncertainty and ambiguity in the environment.
picture_as_pdf View full article Reproducible Research
An Algorithm for Reducing the Number of Distinct Branching Conditions in a Decision Forest (486)
Atsuyoshi Nakamura (Hokkaido University), Kento Sakurada (Hokkaido University)

Given a decision forest, we study a problem of reducing the number of its distinct branching conditionswithout changing each tree's structure while keeping classification performance.A decision forest with a smaller number of distinct branching conditions can not only have a smaller description lengthbut also be implemented by hardware more efficiently. To force the modified decision forest to keep classification performance, we consider a conditionthat the decision paths at each branching node do not change for 100σ% of the given feature vectors passing through the node for a given 0≤σ<1.Under this condition, we propose an algorithm that minimizes the number of distinct branching conditions by sharing the same condition among multiple branching nodes.According to our experimental results using 13 datasets in UCI machine learning repository, our algorithm succeeded more than 90
picture_as_pdf View full article
Beyond Bag-of-Concepts: Vectors of Locally Aggregated Concepts (489)
Maarten Grootendorst (Jheronimus Academy of Data Science), Joaquin Vanschoren (Eindhoven University of Technology)

Bag-of-Concepts, a model that counts the frequency of clustered word embeddings (i.e., concepts) in a document, has demonstrated the feasibility of leveraging clustered word embeddings to create features for document representation. However, information is lost as the word embeddings themselves are not used in the resulting feature vector. This paper presents a novel text representation method, Vectors of Locally Aggregated Concepts (VLAC). Like Bag-of-Concepts, it clusters word embeddings for its feature generation. However, instead of counting the frequency of clustered word embeddings, VLAC takes each cluster's sum of residuals with respect to its centroid and concatenates those to create a feature vector. The resulting feature vectors contain more discriminative information than Bag-of-Concepts due to the additional inclusion of these first order statistics. The proposed method is tested on four different data sets for single-label classification and compared with several baselines, including TF-IDF and Bag-of-Concepts. Results indicate that when combining features of VLAC with TF-IDF significant improvements in performance were found regardless of which word embeddings were used.
picture_as_pdf View full article Reproducible Research
Graph Signal Processing for Directed Graphs based on the Hermitian Laplacian (499)
Satoshi Furutani (NTT Secure Platform Laboratories, Tokyo), Toshiki Shibahara (NTT Secure Platform Laboratories, Tokyo), Mitsuaki Akiyama (NTT Secure Platform Laboratories, Tokyo), Kunio Hato (NTT Secure Platform Laboratories, Tokyo), Masaki Aida (Tokyo Metropolitan University)

Graph signal processing is a useful tool for representing, analyzing, and processing the signal lying on a graph, and has attracted attention in several fields including data mining and machine learning.A key to construct the graph signal processing is the graph Fourier transform, which is defined by using eigenvectors of the graph Laplacian of an undirected graph.The orthonormality of eigenvectors gives the graph Fourier transform algebraically desirable properties, and thus the graph signal processing for undirected graphs has been well developed.However, since eigenvectors of the graph Laplacian of a directed graph are generally not orthonormal, it is difficult to simply extend the graph signal processing to directed graphs.In this paper, we present a general framework for extending the graph signal processing to directed graphs.To this end, we introduce the Hermitian Laplacian which is a complex matrix obtained from an extension of the graph Laplacian.The Hermitian Laplacian is defined so as to preserve the edge directionality and Hermitian property and enables the graph signal processing to be straightforwardly extended to directed graphs.Furthermore, the Hermitian Laplacian guarantees some desirable properties, such as non-negative real eigenvalues and the unitarity of the Fourier transform.Finally, experimental results for representation learning and signal denoising of/on directed graphs show the effectiveness of our framework.
picture_as_pdf View full article
Incorporating Dependencies in Spectral Kernels for Gaussian Processes (510)
Kai Chen (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences; Radboud University; Shenzhen Engineering Laboratory of Ocean Environmental Big Data Analysis and Application), Twan van Laarhoven (Radboud University; Open University of The Netherlands), Jinsong Chen (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences; Shenzhen Engineering Laboratory of Ocean Environmental Big Data Analysis and Application), Elena Marchiori (Radboud University)

Gaussian processes (GPs) are an elegant Bayesian approach to model an unknown function. The choice of the kernel characterizes one's assumption on how the unknown function autocovaries. It is a core aspect of a GP design, since the posterior distribution can significantly vary for different kernels. The spectral mixture (SM) kernel is derived by modelling a spectral density - the Fourier transform of a kernel - with a linear mixture of Gaussian components. As such, the SM kernel cannot model dependencies between components. In this paper we use cross convolution to model dependencies between components and derive a new kernel called Generalized Convolution Spectral Mixture (GCSM). Experimental analysis of GCSM on synthetic and real-life datasets indicates the benefit of modeling dependencies between components for reducing uncertainty and for improving performance in extrapolation tasks.
picture_as_pdf View full article Reproducible Research
Learning Aligned-Spatial Graph Convolutional Networks for Graph Classification (542)
Lu Bai (Central University of Finance and Economics, Beijing, China), Yuhang Jiao (Central University of Finance and Economics, Beijing, China), Lixin Cui (Central University of Finance and Economics, Beijing, China), Edwin R. Hancock (Central University of Finance and Economics, Beijing, China)

In this paper, we develop a novel Aligned-Spatial Graph Convolutional Network (ASGCN) model to learn effective features for graph classification. Our idea is to transform arbitrary-sized graphs into fixed-sized aligned grid structures, and define a new spatial graph convolution operation associated with the grid structures. We show that the proposed ASGCN model not only reduces the problems of information loss and imprecise information representation arising in existing spatially-based Graph Convolutional Network (GCN) models, but also bridges the theoretical gap between traditional Convolutional Neural Network (CNN) models and spatially-based GCN models. Moreover, the proposed ASGCN model can adaptively discriminate the importance between specified vertices during the process of spatial graph convolution, explaining the effectiveness of the proposed model. Experiments on standard graph datasets demonstrate the effectiveness of the proposed model.
picture_as_pdf View full article
Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space (543)
Zac Wellmer (Hong Kong University of Science), James T. Kwok (Technology)

This paper proposes a novel deep reinforcement learning architecture that was inspired by previous tree structured architectures which were only useable in discrete action spaces. Policy Prediction Network offers a way to improve sample complexity and performance on continuous control problems in exchange for extra computation at training time but at no cost in computation at rollout time. Our approach integrates a mix between model-free and model-based reinforcement learning. Policy Prediction Network is the first to introduce implicit model-based learning to Policy Gradient algorithms for continuous action space and is made possible via the empirically justified clipping scheme. Our experiments are focused on the MuJoCo environments so that they can be compared with similar work done in this area.
picture_as_pdf View full article
Beyond the Selected Completely At Random Assumption for Learning from Positive and Unlabeled Data (544)
Jessa Bekker (KU Leuven), Pieter Robberechts (KU Leuven), Jesse Davis (KU Leuven)

Most positive and unlabeled data is subject to selection biases. The labeled examples can, for example, be selected from the positive set because they are easier to obtain or more obviously positive. This paper investigates how learning can be enabled in this setting. We propose and theoretically analyze an empirical-risk-based method for incorporating the labeling mechanism. Additionally, we investigate under which assumptions learning is possible when the labeling mechanism is not fully understood and propose a practical method to enable this. Our empirical analysis supports the theoretical results and shows that taking into account the possibility of a selection bias, even when the labeling mechanism is unknown, improves the trained classifiers.
picture_as_pdf View full article Reproducible Research
Fast Gradient Boosting Decision Trees with Bit-Level Data Structures (557)
Laurens Devos (KU Leuven), Wannes Meert (KU Leuven), Jesse Davis (KU Leuven)

A gradient boosting decision tree model is a powerful machine learning method that iteratively constructs decision trees to form an additive ensemble model. The method uses the gradient of the loss function to improve the model at each iteration step.Inspired by the databaseliterature, we exploit bitset and bitslice data structures in order toimprove the run time efficiency of learning the trees. We can use thesestructures in two ways. First, they can represent the input data itself.Second, they can store the discretized gradient values used by the learningalgorithm to construct the trees in the boosting model. Using thesebit-level data structures reduces the problem of finding thebest split, which involves counting of instances and summing gradientvalues, to counting one-bits in bit strings.Modern CPUs can efficiently count one-bits using AVX2 SIMD instructions.Empirically, ourproposed improvements can result in speed-ups of 2 to up to 10 times ondatasets with a large number of categorical feature without sacrificingpredictive performance.
picture_as_pdf View full article Reproducible Research
Maximal Closed Set and Half-Space Separations in Finite Closure Systems (559)
Florian Seiffarth (University of Bonn), Tamás Horváth (University of Bonn; Fraunhofer IAIS, Sankt Augustin; Fraunhofer Center for Machine Learning, Sankt Augustin), Stefan Wrobel (University of Bonn; Fraunhofer IAIS, Sankt Augustin; Fraunhofer Center for Machine Learning, Sankt Augustin)

Motivated by various binary classification problems in structured data (e.g., graphs or other relational and algebraic structures), we investigate some algorithmic properties of closed set and half-space separation in abstract closure systems.Assuming that the underlying closure system is finite and given by the corresponding closure operator, we formulate some negative and positive complexity results for these two separation problems.In particular, we prove that deciding half-space separability in abstract closure systems is NP-complete in general.On the other hand, for the relaxed problem of maximal closed set separation we propose a simple greedy algorithm and show that it is efficient and has the best possible lower bound on the number of closure operator calls.As a second direction to overcome the negative result above, we consider Kakutani closure systems and show first that our greedy algorithm provides an algorithmic characterization of this kind of set systems.As one of the major potential application fields, we then focus on Kakutani closure systems over graphs and generalize a fundamental characterization result based on the Pasch axiom to graph structure partitioning of finite sets.Though the primary focus of this work is on the generality of the results obtained, we experimentally demonstrate the practical usefulness of our approach on vertex classification in different graph datasets.Motivated by various binary classification problems in structured data (e.g., graphs or other relational and algebraic structures), we investigate some algorithmic properties of closed set and half-space separation in abstract closure systems.Assuming that the underlying closure system is finite and given by the corresponding closure operator, we formulate some negative and positive complexity results for these two separation problems.In particular, we prove that deciding half-space separability in abstract closure systems is NP-complete in general.On the other hand, for the relaxed problem of maximal closed set separation we propose a simple greedy algorithm and show that it is efficient and has the best possible lower bound on the number of closure operator calls.As a second direction to overcome the negative result above, we consider Kakutani closure systems and show first that our greedy algorithm provides an algorithmic characterization of this kind of set systems.As one of the major potential application fields, we then focus on Kakutani closure systems over graphs and generalize a fundamental characterization result based on the Pasch axiom to graph structure partitioning of finite sets.Though the primary focus of this work is on the generality of the results obtained, we experimentally demonstrate the practical usefulness of our approach on vertex classification in different graph datasets.
picture_as_pdf View full article
Assessing the multi-labelness of multi-label data (562)
Laurence A. F. Park (Western Sydney University), Yi Guo (Western Sydney University), Jesse Read (École Polytechnique)

Before constructing a classifier, we should examine the data to gainan understanding of the relationships between the variables, toassist with the design of the classifier. Using multi-label datarequires us to examine the association between labels: its multi-labelness. We cannotdirectly measure association between two labels, since the labels'relationships are confounded with the set of observation variables.A better approach is to fit an analytical model to a label withrespect to the observations and remaining labels, but this mightpresent false relationships due to the problem of multicollinearitybetween the observations and labels.In this article, we examine the utility of regularised logisticregression and a new form of split logistic regression for assessingthe multi-labelness of data.We find that a split analytical model usingregularisation is able to provide fewer label relationships when norelationships exist, or if the labels can be partitioned. We alsofind that if label relationships do exist, logistic regression with l_1 regularisationprovides the better measurement of multi-labelness.
picture_as_pdf View full article
A Semi-discriminative Approach for Sub-sentence Level Topic Classification on a Small Dataset (566)
Cornelia Ferner (Salzburg University of Applied Sciences), Stefan Wegenkittl (Salzburg University of Applied Sciences)

This paper aims at identifying sequences of words related to specific product components in online product reviews. A reliable baseline performance for this topic classification problem is given by a Max Entropy classifier which assumes independence over subsequent topics. However, the reviews exhibit an inherent structure on the document level allowing to frame the task as sequence classification problem. Since more flexible models from the class of Conditional Random Fields were not competitive because of the limited amount of training data available, we propose using a Hidden Markov Model instead and decouple the training of transition and emission probabilities. The discriminating power of the Max Entropy approach is used for the latter.Besides outperforming both standalone methods as well as more generic models such as linear-chain Conditional Random Fields, the combined classifier is able to assign topics on sub-sentence level although labeling in the training data is only available on sentence level.
picture_as_pdf View full article Reproducible Research
Black Box Explanation by Learning Image Exemplars in the Latent Feature Space (572)
Riccardo Guidotti (ISTI-CNR, Pisa), Anna Monreale (University of Pisa), Stan Matwin (Dalhousie University; Polish Academy of Sciences), Dino Pedreschi (University of Pisa)

We present an approach to explain the decisions of black box models for image classification. While using the black box to label images, our explanation method exploits the latent feature space learned through an adversarial autoencoder. The proposed method first generates exemplar images in the latent feature space and learns a decision tree classifier. Then, it selects and decodes exemplars respecting local decision rules. Finally, it visualizes them in a manner that shows to the user how the exemplars can be modified to either stay within their class, or to become counter-factuals by "morphing" into another class. Since we focus on black box decision systems for image classification, the explanation obtained from the exemplars also provides a saliency map highlighting the areas of the image that contribute to its classification, and areas of the image that push it into another class. We present the results of an experimental evaluation on three datasets and two black box models. Besides providing the most useful and interpretable explanations, we show that the proposed method outperforms existing explainers in terms of fidelity, relevance, coherence, and stability.
picture_as_pdf View full article Reproducible Research
Cost Sensitive Evaluation of Instance Hardness in Machine Learning (574)
Ricardo B. C. Prudêncio (Universidade Federal de Pernambuco)

Measuring hardness of individual instances in machine learning contributes to a deeper analysis of learning performance. This work proposes instance hardness measures for binary classification in cost-sensitive scenarios. Here cost curves are generated for each instance, defined as the loss observed for a pool of learning models for that instance along the range of cost proportions. Instance hardness is defined as the area under the cost curves and can be seen as an expected loss of difficulty along cost proportions. Different cost curves were proposed by considering common decision threshold choice methods in literature, thus providing alternative views of instance hardness.
picture_as_pdf View full article
Meta-Learning for Black-box Optimization (576)
Vishnu TV (TCS Research, New Delhi), Pankaj Malhotra (TCS Research, New Delhi), Jyoti Narwariya (TCS Research, New Delhi), Lovekesh Vig (TCS Research, New Delhi), Gautam Shroff (TCS Research, New Delhi)

Recently, neural networks trained as optimizers under the "learning to learn" or meta-learning framework have been shown to be effective for a broad range of optimization tasks including derivative-free black-box function optimization. Recurrent neural networks (RNNs) trained to optimize a diverse set of synthetic non-convex differentiable functions via gradient descent have been effective at optimizing derivative-free black-box functions.In this work, we propose RNN-Opt: an approach for learning RNN-based optimizers for optimizing real-parameter single-objective continuous functions under limited budget constraints.Existing approaches utilize an observed improvement based meta-learning loss function for training such models. We propose training RNN-Opt by using synthetic non-convex functions with known (approximate) optimal values by directly using discounted regret as our meta-learning loss function. We hypothesize that a regret-based loss function mimics typical testing scenarios, and would therefore lead to better optimizers compared to optimizers trained only to propose queries that improve over previous queries.Further, RNN-Opt incorporates simple yet effective enhancements during training and inference procedures to deal with the following practical challenges: (i) Unknown range of possible values for the black-box function to be optimized, and (ii) Practical and domain-knowledge based constraints on the input parameters.We demonstrate the efficacy of RNN-Opt in comparison to existing methods on several synthetic as well as standard benchmark black-box functions along with an anonymized industrial constrained optimization problem.
picture_as_pdf View full article
Robust Anomaly Detection in Images using Adversarial Autoencoders (581)
Laura Beggel (Bosch Center for Artificial Intelligence, Renningen; Ludwig-Maximilians-University Munich), Michael Pfeiffer (Bosch Center for Artificial Intelligence, Renningen), Bernd Bischl (Ludwig-Maximilians-University Munich)

Reliably detecting anomalies in a given set of images is a task of high practical relevance for visual quality inspection, surveillance, or medical image analysis. Autoencoder neural networks learn to reconstruct normal images, and hence can classify those images as anomalies, where the reconstruction error exceeds some threshold. Here we analyze a fundamental problem of this approach when the training set is contaminated with a small fraction of outliers.We find that continued training of autoencoders inevitably reduces the reconstruction error of outliers, and hence degrades the anomaly detection performance. In order to counteract this effect, an adversarial autoencoder architecture is adapted, which imposes a prior distribution on the latent representation, typically placing anomalies into low likelihood-regions.Utilizing the likelihood model, potential anomalies can be identified and rejected already during training, which results in an anomaly detector that is significantly more robust to the presence of outliers during training.
picture_as_pdf View full article
Attentive Multi-Task Deep Reinforcement Learning (582)
Timo Bräm (ETH Zurich), Gino Brunner (ETH Zurich)

Sharing knowledge between tasks is vital for efficient learning in a multi-task setting.However, most research so far has focused on the easier case where knowledge transfer is not harmful, i.e., where knowledge from one task cannot negatively impact the performance on another task.In contrast, we present an approach to multi-task deep reinforcement learning based on attention that does not require any a-priori assumptions about the relationships between tasks. Our attention network automatically groups task knowledge into sub-networks on a state level granularity. It thereby achieves positive knowledge transfer if possible, and avoids negative transfer in cases where tasks interfere. We test our algorithm against two state-of-the-art multi-task/transfer learning approaches and show comparable or superior performance while requiring fewer network parameters.
picture_as_pdf View full article Reproducible Research
Non-parametric Bayesian Isotonic Calibration: Fighting Over-confidence in Binary Classification (587)
Mari-Liis Allikivi (University of Tartu), Meelis Kull (University of Tartu)

Classifiers can often output a score or a probability indicating how sure they are about the predicted class. Classifier calibration methods can map these into calibrated class probabilities, supporting cost-optimal decision making. Isotonic calibration is the standard non-parametric calibration method for binary classifiers, and it can be shown to yield the most likely monotonic calibration map on the given data, where monotonicity means that instances with higher predicted scores are more likely to be positive. Another non-parametric method is ENIR (ensemble of near-isotonic regression models) which allows for some non-monotonicity, but adds a penalty for it. We first demonstrate that these two methods tend to be over-confident and show that applying label smoothing improves calibration of both methods in more than 90% of studied cases. Unfortunately, label smoothing reduces confidence on the under-confident predictions also, and it does not reduce the raggedness of isotonic calibration. As the main contribution we propose a non-parametric Bayesian isotonic calibration method which has the flexibility of isotonic calibration to fit maps of all monotonic shapes but it adds smoothness and reduces over-confidence without requiring label smoothing. The method introduces a prior over piecewise linear monotonic calibration maps and uses a simple Monte Carlo sampling based approach to approximate the posterior mean calibration map.Our experiments demonstrate that on average the proposed method results in better calibrated probabilities than the state-of-the-art calibration methods, including isotonic calibration and ENIR.
picture_as_pdf View full article Reproducible Research
A Stochastic Quasi-Newton Method with Nesterov's Accelerated Gradient (594)
S. Indrapriyadarsini (Shizuoka University), Shahrzad Mahboubi (Shonan Institute of Technology), Hiroshi Ninomiya (Shonan Institute of Technology), Hideki Asai (Shizuoka University)

Incorporating second order curvature information in gradient based methods have shown to improve convergence drastically despite its computational intensity. In this paper, we propose a stochastic (online) quasi-Newton method with Nesterov's accelerated gradient in both its full and limited memory forms for solving large scale non-convex optimization problems in neural networks. The performance of the proposed algorithm is evaluated in Tensorflow on benchmark classification and regression problems. The results show improved performance compared to the classical second order oBFGS and oLBFGS methods and popular first order stochastic methods such as SGD and Adam. The performance with different momentum rates and batch sizes have also been illustrated.
picture_as_pdf View full article
Shrinkage Estimators for Uplift Regression (595)
Krzysztof Rudaś (Warsaw University of Technology; Institute of Computer Science, Polish Academy of Sciences), Szymon Jaroszewicz (Institute of Computer Science, Polish Academy of Sciences)

Uplift modeling is an approach to machine learning which allows forpredicting the net effect of an action (with respect to not taking theaction). To achieve this, the training population is divided into twoparts: the treatment group, which is subjected to the action, and thecontrol group, on which the action is not taken. Our task is toconstruct a model which will predict the difference between outcomes inthe treatment and control groups conditional on individual objects'features. When the group assignment is random, the model admits acausal interpretation. When we assume linear responses in bothgroups, the simplest way of estimating the net effect of the action onan individual is to build two separate linear ordinary least squares(OLS) regressions on the treatment and control groups and compute thedifference between their predictions. In classical linear modelsimprovements in accuracy can be achieved through the use of so calledshrinkage estimators such as the well known James-Stein estimator,which has a provably lower mean squared error than the OLSestimator. In this paper we investigate the use of shrinkageestimators in the uplift modeling problem. Unfortunately directgeneralization of the James-Stein estimator does not lead to improvedpredictions, nor does shrinking treatment and control modelsseparately. Therefore, we propose a new uplift shrinkage method whereestimators in the treatment and control groups are shrunk jointly so as tominimize the error in the predicted net effect of the action. Weprove that the proposed estimator does indeed improve on the doubleregression estimator.
picture_as_pdf View full article
Training Discrete-Valued Neural Networks with Sign Activations Using Weight Distributions (596)
Wolfgang Roth (Graz University of Technology), Günther Schindler (Ruprecht Karls University, Heidelberg), Holger Fröning (Ruprecht Karls University, Heidelberg), Franz Pernkopf (Graz University of Technology)

Since resource-constrained devices hardly benefit from the trend towards ever-increasing neural network (NN) structures, there is growing interest in designing more hardware-friendly NNs.In this paper, we consider the training of NNs with discrete-valued weights and sign activation functions that can be implemented more efficiently in terms of inference speed, memory requirements, and power consumption.We build on the framework of probabilistic forward propagations using the local reparameterization trick, where instead of training a single set of NN weights we rather train a distribution over these weights.Using this approach, we can perform gradient-based learning by optimizing the continuous distribution parameters over discrete weights while at the same time perform backpropagation through the sign activation.In our experiments, we investigate the influence of the number of weights on the classification performance on several benchmark datasets, and we show that our method achieves state-of-the-art performance.
picture_as_pdf View full article Reproducible Research
node2bits: Compact Time- and Attribute-aware Node Representations for User Stitching (597)
Di Jin (University of Michigan), Mark Heimann (University of Michigan), Ryan A. Rossi (Adobe Research), Danai Koutra (University of Michigan)

Identity stitching, the task of identifying and matching various online references (e.g., sessions over different devices and timespans) to the same user in real-world web services, is crucial for personalization and recommendations. However, traditional user stitching approaches, such as grouping or blocking, require quadratic pairwise comparisons between a massive number of user activities, thus posing both computational and storage challenges. Recent works, which are often application-specific, heuristically seek to reduce the amount of comparisons, but they suffer from low precision and recall. To solve the problem in an application-independent way, we take a heterogeneous network-based approach in which users (nodes) interact with content (e.g., sessions, websites), and may have attributes (e.g., location). We propose node2bits, an efficient framework that represents multi-dimensional features of node contexts with binary hashcodes.node2bits leverages feature-based temporal walks to encapsulate short- and long-term interactions between nodes in heterogeneous web networks, and adopts SimHash to obtain compact, binary representations and avoid the quadratic complexity for similarity search.Extensive experiments on large-scale real networks show that node2bits outperforms traditional techniques and existing works that generate real-valued embeddings by up to 5.16
picture_as_pdf View full article Reproducible Research
A Soft Affiliation Graph Model for Scalable Overlapping Community Detection (618)
Nishma Laitonjam (Insight Centre for Data Analytics, University College Dublin), Wěipéng Huáng (Insight Centre for Data Analytics, University College Dublin), Neil J. Hurley (Insight Centre for Data Analytics, University College Dublin)

We propose an overlapping community model based on the Affiliation Graph Model (AGM), that exhibits the pluralistic homophily property that the probability of a link between nodes increases with increasing number of shared communities. We take inspiration from the Mixed Membership Stochastic Blockmodel (MMSB), in proposing an edgewise community affiliation. This allows decoupling of community affiliations between nodes, opening the way to scalable inference. We show that our model corresponds to an AGM with soft community affiliations and develop a scalable algorithm based on a Stochastic Gradient Riemannian Langevin Dynamics(SGRLD) sampler. Empirical results show that the model can scale to network sizes that are beyond the capabilities of MCMC samplers of the standard AGM. We achieve comparable performance in terms of accuracy and run-time efficiency to scalable MMSB samplers.
picture_as_pdf View full article Reproducible Research
Synthetic Oversampling of Multi-Label Data based on Local Label Distribution (624)
Bin Liu (Aristotle University of Thessaloniki), Grigorios Tsoumakas (Aristotle University of Thessaloniki)

Class-imbalance is an inherent characteristic of multi-label data which affects the prediction accuracy of most multi-label learning methods. One efficient strategy to deal with this problem is to employ resampling techniques before training the classifier. Existing multi-label sampling methods alleviate the (global) imbalance of multi-label datasets. However, performance degradation is mainly due to rare sub-concepts and overlapping of classes that could be analysed by looking at the local characteristics of the minority examples, rather than the imbalance of the whole dataset. We propose a new method for synthetic oversampling of multi-label data that focuses on local label distribution to generate more diverse and better labeled instances. Experimental results on 13 multi-label datasets demonstrate the effectiveness of the proposed approach in a variety of evaluation measures, particularly in the case of an ensemble of classifiers trained on repeated samples of the original data.
picture_as_pdf View full article Reproducible Research
Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning (628)
Robin Vogel (Telecom Paris, LTCI, Institut Polytechnique de Paris; IDEMIA), Aurélien Bellet (INRIA), Stephan Clémençon (Telecom Paris, LTCI, Institut Polytechnique de Paris), Ons Jelassi (Telecom Paris, LTCI, Institut Polytechnique de Paris), Guillaume Papa (Telecom Paris, LTCI, Institut Polytechnique de Paris)

The development of cluster computing frameworks hasallowed practitioners to scale out various statistical estimation and machinelearning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression.In contrast, statistical learning tasks involving pairs (ormore generally tuples) of data points - such as metric learning,clustering or ranking - do not lend themselves as easily todata-parallelismand in-memory computing.In this paper, we investigate how to balance between statistical performanceand computational efficiency in such distributed tuplewise statisticalproblems. We first propose a simple strategy based on occasionallyrepartitioningdata across workers between parallel computation stages, where the number ofrepartitioning steps rules the trade-off between accuracy and runtime. We thenpresent some theoretical results highlighting the benefits brought by theproposed method in terms of variance reduction, and extend our results todesigndistributed stochastic gradient descent algorithms for tuplewiseempiricalrisk minimization.Our results are supported by numerical experiments in pairwisestatistical estimation and learning on synthetic and real-world datasets.
picture_as_pdf View full article Reproducible Research
Fast likelihood-based change point detection (634)
Nikolaj Tatti (University of Helsinki)

Change point detection plays a fundamental role in many real-world applications,where the goal is to analyze and monitor the behaviour of a data stream.In this paper, we study change detection in binary streams. To this end, weuse a likelihood ratio between two models as a measure for indicatingchange. The first model is a single bernoulli variable while the second modeldivides the stored data in two segments, and models each segment with itsown bernoulli variable. Finding the optimal split can be done in O(n) time,where n is the number of entries since the last change point.This is too expensive for large n. To combat this we propose an approximationscheme that yields (1 - ϵ) approximation in O(ϵ^-1^2 n) time.The speed-up consists of several steps: First we reduce the number of possiblecandidates by adopting a known result from segmentation problems. We then showthat for fixed bernoulli parameters we can find the optimal change point inlogarithmic time. Finally, we show how to construct a candidate listof size O(ϵ^-1 n) for model parameters.We demonstrate empirically the approximation quality and the running time ofour algorithm, showing that we can gain a significant speed-up with a minimalaverage loss in optimality.
picture_as_pdf View full article Reproducible Research
Hyper-Parameter-Free Generative Modelling with Deep Boltzmann Trees (637)
Nico Piatkowski (TU Dortmund)

Deep neural networks achieve state-of-the-art results in various classification and synthetic data generation tasks. However, only little is known about why depth improves a model. We investigate the structure of stochastic deep neural works, also known as Deep Boltzmann Machines, to shed some light on this issue. While the best known results postulate an exponential dependence between the number of visible units and the depth of the model, we show that the required depth is upper bounded by the longest path in the underlying junction tree, which is at most linear in the number of visible units. Moreover, we show that the conditional independence structure of any categorical Deep Boltzmann Machine contains a sub-tree that allows the consistent estimation of the full joint probability mass function of all visible units. We connect our results to l_1-regularized maximum-likelihood estimation and Chow-Liu trees. Based on our theoretical findings, we present a new tractable version of Deep Boltzmann Machines, namely the Deep Boltzmann Tree (DBT). We provide a hyper-parameter-free algorithm for learning the DBT from data, and propose a new initialization method to enforce convergence to good solutions. Our findings provide some theoretical evidence for why a deep model might be beneficial. Experimental results on benchmark data show, that the DBT is a theoretical sound alternative to likelihood-free generative models.
picture_as_pdf View full article
Deep convolutional Gaussian processes (645)
Kenneth Blomqvist (Aalto University; Helsinki Institute for Information Technology HIIT), Samuel Kaski (Aalto University; Helsinki Institute for Information Technology HIIT), Markus Heinonen (Aalto University; Helsinki Institute for Information Technology HIIT)

We propose deep convolutional Gaussian processes, a deep Gaussian process architecture with convolutional structure. The model is a principled Bayesian framework for detecting hierarchical combinations of local features for image classification. We demonstrate greatly improved image classification performance compared to current convolutional Gaussian process approaches on the MNIST and CIFAR-10 datasets. In particular, we improve state-of-the-art CIFAR-10 accuracy by over 10 percentage points.
picture_as_pdf View full article Reproducible Research
Stochastic One-Sided Full-Information Bandit (646)
Haoyu Zhao (Tsinghua University), Wei Chen (Microsoft Research, Beijing)

In this paper, we study the stochastic version of the one-sided full information bandit problem, where we have K arms [K] = {1, 2, ..., K}, and playing arm i would gain reward from an unknown distribution for arm iwhile obtaining reward feedback for all arms j > i.One-sided full information bandit can model the online repeated second-price auctions, where the auctioneer could select the reserved price in each round and the bidders only reveal their bids when their bids are higher than the reserved price. In this paper, we present an elimination-based algorithm to solve the problem. Our elimination based algorithm achieves distribution independent regret upper bound O(√(T· (TK))), and distribution dependent bound O(( T + K)f(Δ)), where T is the time horizon, Δ is a vector of gaps between the mean reward of arms and the mean reward of the best arm, and f(Δ) is a formula depending on the gap vector that we will specify in detail. Our algorithm has the best theoretical regret upper bound so far.We also validate our algorithm empirically against other possible alternatives.
picture_as_pdf View full article
Sets of Robust Rules, and How to Find Them (650)
Jonas Fischer (Max Planck Institute for Informatics; Saarland University), Jilles Vreeken (CISPA Helmholtz Center for Information Security)

Association rules are among the most important concepts in data mining. Rules of the form X -> Y are simple to understand, simple to act upon, yet can model important local dependencies in data. The problem is, however, that there are so many of them. Both traditional and state-of-the-art frameworkstypically yield millions of rules, rather than identifying a small set of rules that capture the most important dependencies of the data.In this paper, we define the problem of association rule mining in terms of the Minimum Description Length principle.That is, we identify the best set of rules as the one that most succinctly describes the data.We show that the resulting optimization problem does not lend itself for exact search, and hence propose Grab, a greedy heuristic to efficiently discover good sets of noise-resistant rules directly from data.Through extensive experiments we show that, unlike the state-of-the-art, Grab does reliably recover the ground truth.On real world data we show it finds reasonable numbers of rules, that upon close inspection give clear insight in the local distribution of the data.
picture_as_pdf View full article Reproducible Research
A Drift-based Dynamic Ensemble Members Selection using Clustering for Time Series Forecasting (662)
Amal Saadallah (TU Dortmund), Florian Priebe (TU Dortmund), Katharina Morik (TU Dortmund)

Both complex and evolving nature of time series structure make forecasting among one of the most important and challenging tasks in time series analysis. Typical methods for forecasting are designed to model time-evolving dependencies between data observations. However, it is generally accepted that none of them is universally valid for every application. Therefore, methods for learning heterogeneous ensembles by combining a diverse set of forecasts together appear as a promising solution to tackle this task. Hitherto, in classical ML literature, ensemble techniques such as stacking, cascading and voting are mostly restricted to operate in a static manner. To deal with changes in the relative performance of models as well as changes in the data distribution, we propose a drift-aware meta-learning approach for adaptively selecting and combining forecasting models. Our assumption is that different forecasting models have different areas of expertise and a varying relative performance. Our method ensures dynamic selection of initial ensemble base models candidates through a performance drift detection mechanism. Since diversity is a fundamental component in ensemble methods, we propose a second stage selection with clustering that is computed after each drift detection. Predictions of final selected models are combined into a single prediction. An exhaustive empirical testing of the method was performed, evaluating both generalization error and scalability of the approach using time series from several real world domains. Empirical results show the competitiveness of the method in comparison to state-of-the-art approaches for combining forecasters.
picture_as_pdf View full article Reproducible Research
Node Classification for Signed Social Networks Using Diffuse Interface Methods (665)
Pedro Mercado (University of Tübingen), Jessica Bosch (University of British Columbia), Martin Stoll (Technische Universität Chemnitz)

Signed networks contain both positive and negative kinds of interactions like friendship and enmity. The task of node classification in non-signed graphs has proven to be beneficial in many real world applications, yet extensions to signed networks remain largely unexplored. In this paper we introduce the first analysis of node classification in signed social networks via diffuse interface methods based on the Ginzburg-Landau functional together with different extensions of the graph Laplacian to signed networks. We show that blending the information from both positive and negative interactions leads to performance improvement in real signed social networks, consistently outperforming the current state of the art.
picture_as_pdf View full article Reproducible Research
Thu, 14:00 - 14:20 @ 0.001 (Poster@Thu) Clustering
Holistic Assessment of Structure Discovery Capabilities of Clustering Algorithms (678)
Frank Höppner (Ostfalia University of Applied Sciences), Maximilian Jahnke (Ostfalia University of Applied Sciences)

Existing cluster validity indices often possess a similar bias as theclustering algorithm they were introduced for, e.g. to determine theoptimal number of clusters. We suggest an efficient and holisticassessment of the structure discovery capabilities of clusteringalgorithms based on three criteria. We determine the robustness orstability of cluster assignments and interpret it as the confidence ofthe clustering algorithm in its result. This information is then usedto label the data and evaluate the consistency of thestability-assessment with the notion of a cluster as an area of denseand separated data. The resulting criteria of stability, structure andconsistency provide interpretable means to judge the capabilities ofclustering algorithms without the typical biases of prominent indices,including the judgment of a clustering tendency.
picture_as_pdf View full article Reproducible Research
Pattern-Based Anomaly Detection in Mixed-Type Time Series (681)
Len Feremans (University of Antwerp), Vincent Vercruyssen (KU Leuven), Boris Cule (University of Antwerp), Wannes Meert (KU Leuven), Bart Goethals (University of Antwerp; Monash University)

The present-day accessibility of technology enables easy logging of both sensor values and event logs over extended periods. In this context, detecting abnormal segments in time series data has become an important data mining task. Existing work on anomaly detection focuses either on continuous time series or discrete event logs and not on the combination.However, in many practical applications, the patterns extracted from the event log can reveal contextual and operational conditions of a device that must be taken into account when predicting anomalies in the continuous time series.This paper proposes an anomaly detection method that can handle mixed-type time series. The method leverages frequent pattern mining techniques to construct an embedding of mixed-type time series on which an isolation forest is trained. Experiments on several real-world univariate and multivariate time series, as well as a synthetic mixed-type time series, show that our anomaly detection algorithm outperforms state-of-the-art anomaly detection techniques such as MatrixProfile, Pav, Mifpod and Fpof.
picture_as_pdf View full article Reproducible Research
Sobolev Training with Approximated Derivatives for Black-Box Function Regression with Neural Networks (725)
Matthias Kissel (Technical University of Munich), Klaus Diepold (Technical University of Munich)

With Sobolev Training, neural networks are trained to fit target output values as well as target derivatives with respect to the inputs. This leads to better generalization and fewer required training examples for certain problems. In this paper, we present a training pipeline that enables Sobolev Training for regression problems where target derivatives are not directly available. Thus, we propose to use a least-squares estimate of the target derivatives based on function values of neighboring training samples. We show for a variety of black-box function regression tasks that our training pipeline achieves smaller test errors compared to the traditional training method. Since our method has no additional requirements on the data collection process, it has great potential to improve the results for various regression tasks.
picture_as_pdf View full article Reproducible Research
Tue, 10:00 - 10:30 @ 0.004 (AOK-HS) (Poster@Wed) Best ML Paper
Agnostic feature selection (744)
Guillaume Doquet (TAU CNRS - INRIA - LRI - Université Paris-Saclay), Michèle Sebag (TAU CNRS - INRIA - LRI - Université Paris-Saclay)

Unsupervised feature selection is mostly assessed along a supervised learning setting, depending on whether the selected features efficiently permit to predict the (unknown) target variable. Another setting is proposed in this paper: the selected features aim to efficiently recover the whole dataset. The proposed algorithm, called AgnoS, combines an AutoEncoder with structural regularizations to sidestep the combinatorial optimization problem at the core of feature selection. The extensive experimental validation of AgnoS on the scikit-feature benchmark suite demonstrates its ability compared to the state of the art, both in terms of supervised learning and data compression.
picture_as_pdf View full article
BelMan: An Information-Geometric Approach to Stochastic Bandits (783)
Debabrota Basu (Chalmers University of Technology), Pierre Senellart (DI ENS, ENS, CNRS, PSL University; INRIA), Stéphane Bressan (National University of Singapore)

We propose a Bayesian information-geometric approach to the exploration-exploitation trade-off in stochastic multi-armed bandits. The uncertainty on reward generation and belief is represented using the manifold of joint distributions of rewards and beliefs. Accumulated information is summarised by the barycentre of joint distributions, the pseudobelief-reward. While the pseudobelief-reward facilitates information accumulation through exploration, another mechanism is needed to increase exploitation by gradually focusing on higher rewards, the pseudobelief-focal-reward. Our resulting algorithm, BelMan, alternates between projection of the pseudobelief-focal-reward onto belief-reward distributions to choose the arm to play, and projection of the updated belief-reward distributions onto the pseudobelief-focal-reward. We theoretically prove BelMan to be asymptotically optimal and to incur a sublinear regret growth. We instantiate BelMan to stochastic bandits with Bernoulli and exponential rewards, and to a real-life application of scheduling queueing bandits. Comparative evaluation with the state of the art shows that BelMan is not only competitive for Bernoulli bandits but in many cases also outperforms other approaches for exponential and queueing bandits.
picture_as_pdf View full article Reproducible Research
L_0-ARM: Network Sparsification via Stochastic Binary Optimization (785)
Yang Li (Georgia State University), Shihao Ji (Georgia State University)

We consider network sparsification as an L_0-norm regularized binary optimization problem, where each unit of a neural network (e.g., weight, neuron, or channel, etc.) is attached with a stochastic binary gate, whose parameters are jointly optimized with original network parameters. The Augment-Reinforce-Merge (ARM), a recently proposed unbiased gradient estimator, is investigated for this binary optimization problem. Compared to the hard concrete gradient estimator from Louizos et al., ARM demonstrates superior performance of pruning network architectures while retaining almost the same accuracies of baseline methods. Similar to the hard concrete estimator, ARM also enables conditional computation during model training but with improved effectiveness due to the exact binary stochasticity. Thanks to the flexibility of ARM, many smooth or non-smooth parametric functions, such as scaled sigmoid or hard sigmoid, can be used to parameterize this binary optimization problem and the unbiasness of the ARM estimator is retained, while the hard concrete estimator has to rely on the hard sigmoid function to achieve conditional computation and thus accelerated training. Extensive experiments on multiple public datasets demonstrate state-of-the-art pruning rates with almost the same accuracies of baseline methods. The resulting algorithm L_0-ARM sparsifies the Wide-ResNet models on CIFAR-10 and CIFAR-100 while the hard concrete estimator cannot.
picture_as_pdf View full article
Learning with Random Learning Rates (805)
Léonard Blier (TAU, LRI, Inria, Université Paris Sud; Facebook AI Research), Pierre Wolinski (TAU, LRI, Inria, Université Paris Sud), Yann Ollivier (Facebook AI Research)

In neural network optimization, the learning rate of the gradient descent strongly affects performance. This prevents reliable out-of-the-box training of a model on a new problem. We propose theAll Learning Rates At Once (Alrao) algorithm fordeep learning architectures:each neuron or unit in the network gets its own learning rate, randomly sampled atstartupfrom a distribution spanning several orders of magnitude. The network becomes a mixture of slow and fast learning units. Surprisingly, Alrao performs close to SGD with an optimally tuned learning rate, for various tasks and network architectures. In our experiments, all Alrao runs were able to learn well without any tuning.
picture_as_pdf View full article Reproducible Research
Triangle Completion Time Prediction using Time-conserving Embedding (806)
Vachik S. Dave (Indiana University Purdue University Indianapolis), Mohammad Al Hasan (Indiana University Purdue University Indianapolis)

A triangle is an important building block of social networks, so the study of triangleformation in a network is critical for better understanding of the dynamics of suchnetworks. Existing works in this area mainly focus on triangle counting, or generating synthetic networks by matching the prevalence of trianglesin real-life networks. While these efforts increase our understanding of triangle's role in a network, they have limited practical utility. In this work we undertake aninteresting problem relating to triangle formation in a network, which is, to predict the time by which the third link of a triangle appears in a network. Since the thirdlink completes a triangle, we name this task as Triangle CompletionTime Prediction (TCTP). Solution to TCTP problem is valuable forreal-life link recommendation in social/e-commerce networks, also it provides vital information for dynamic network analysis and community generation study.An efficient and robust framework (GraNiTE) is proposed for solving the TCTP problem.GraNiTE uses neural networks based approach for learning a representation vector of a trianglecompleting edge, which is a concatenation of two representation vectors: first one is learnt from graphlet based local topology around that edge and the second one islearnt from time-preserving embedding of the constituting vertices of that edge. A comparison of the proposed solution with several baselinemethods shows that the mean absolute error (MAE) of the proposed method is at least one-forth of that of the best baseline method.
picture_as_pdf View full article Reproducible Research
Thu, 14:20 - 14:40 @ 0.001 (Poster@Thu) Clustering
k is the Magic Number - Inferring the Number of Clusters Through Nonparametric Concentration Inequalities (811)
Sibylle Hess (Technische Universiteit Eindhoven), Wouter Duivesteijn (Technische Universiteit Eindhoven)

Most convex and nonconvex clustering algorithms come with one crucial parameter: the k in k-means. To this day, there is not one generally accepted way to accurately determine this parameter. Popular methods are simple yet theoretically unfounded, such as searching for an elbow in the curve of a given cost measure. In contrast, statistically founded methods often make strict assumptions over the data distribution or come with their own optimization scheme for the clustering objective. This limits either the set of applicable datasets or clustering algorithms. In this paper, we strive to determine the number of clusters by answering a simple question: given two clusters, is it likely that they jointly stem from a single distribution? To this end, we propose a bound on the probability that two clusters originate from the distribution of the unified cluster, specified only by the sample mean and variance. Our method is applicable as a simple wrapper to the result of any clustering method minimizing the objective of k-means, which includes Gaussian mixtures and Spectral Clustering. We focus in our experimental evaluation on an application for nonconvex clustering and demonstrate the suitability of our theoretical results. Our SpecialK clustering algorithm automatically determines the appropriate value for k, without requiring any data transformation or projection, and without assumptions on the data distribution. Additionally, it is capable to decide that the data consists of only a single cluster, which many existing algorithms cannot.
picture_as_pdf View full article Reproducible Research
Tue, 15:20 - 15:40 @ 1.011 (Poster@Wed) Ranking
Sequential Learning over Implicit Feedback for Robust Large-Scale Recommender Systems (820)
Aleksandra Burashnikova (Skolkovo Institute of Science and Technology), Yury Maximov (Theoretical Division T-5 and CNLS, Los Alamos National Laboratory), Massih-Reza Amini (Université Grenoble Alpes)

In this paper, we propose a theoretically founded sequentialstrategy for training large-scale Recommender Systems (RS) over implicitfeedback mainly in the form of clicks. The proposed approach consistsin minimizing pairwise ranking loss over blocks of consecutive itemsconstituted by a sequence of non-clicked items followed by a clicked onefor each user. Parameter updates are discarded if for a given user thenumber of sequential blocks is below or above some given thresholdsestimated over the distribution of the number of blocks in the training set.This is to prevent from updating the parameters for an abnormally highnumber of clicks over some targeted items, mainly due to bots; or very fewuser interactions. Both scenarios affect the decision of RS and imply a shiftover the distribution of items that are shown to the users. We provide aproof of convergence of the algorithm to the minimizer of the ranking loss,in the case where the latter is convex. Furthermore, experimental resultson five large-scale collections demonstrate the efficiency of the proposedalgorithm concerning the state-of-the-art approaches, both regardingdifferent ranking measures and computation time.
picture_as_pdf View full article Reproducible Research
Generating Black-Box Adversarial Examples for Text Classifiers Using a Deep Reinforced Model (852)
Prashanth Vijayaraghavan (MIT Media Lab), Deb Roy (MIT Media Lab)

Recently, generating adversarial examples has become an important means of measuring robustness of a deep learning model. Adversarial examples help us identify the susceptibilities of the model and further counter those vulnerabilities by applying adversarial training techniques. In natural language domain, small perturbations in the form of misspellings or paraphrases can drastically change the semantics of the text. We propose a reinforcement learning based approach towards generating adversarial examples in black-box settings. We demonstrate that our method is able to fool well-trained models for (a) IMDB sentiment classification task and (b) AG's news corpus news categorization task with significantly high success rates. We find that the adversarial examples generated are semantics-preserving perturbations to the original text.
picture_as_pdf View full article
Thu, 10:00 - 10:30 @ 0.004 (AOK-HS) (Poster@Thu) Best DM Paper
FastPoint: Scalable Deep Point Processes (861)
Ali Caner Türkmen (Boğaziçi University), Yuyang Wang (Amazon Research), Alexander J. Smola (Amazon Research)

We propose FastPoint, a novel multivariate point process that enables fast and accurate learning and inference. FastPoint uses deep recurrent neural networks to capture complex temporal dependency patterns among different marks, while self-excitation dynamics within each mark are modeled with Hawkes processes. This results in substantially more efficient learning and scales to millions of correlated marks with superior predictive accuracy. Our construction also allows for efficient and parallel sequential Monte Carlo sampling for fast predictive inference. FastPoint outperforms baseline methods in prediction tasks on synthetic and real-world high-dimensional event data at a small fraction of the computational cost.
picture_as_pdf View full article
From abstract items to latent spaces to observed data and back: Compositional Variational Auto-Encoder (869)
Victor Berger (TAU, CNRS - INRIA - LRI - Univ. Paris-Saclay), Michele Sebag (TAU, CNRS - INRIA - LRI - Univ. Paris-Saclay)

Conditional Generative Models are now acknowledged an essential tool in Machine Learning. This paper focuses on their control. While many approaches aim at disentangling the data through the coordinate-wise control of their latent representations, another direction is explored in this paper. The proposed CompVAE handles data with a natural multi-ensemblist structure (i.e. that can naturally be decomposed into elements). Derived from Bayesian variational principles, CompVAE learns a latent representation leveraging both observational and symbolic information. A first contribution of the approach is that this latent representation supports a compositional generative model, amenable to multi-ensemblist operations (addition or subtraction of elements in the composition). This compositional ability is enabled by the invariance and generality of the whole framework w.r.t. respectively, the order and number of the elements. The second contribution of the paper is a proof of concept on synthetic 1D and 2D problems, demonstrating the efficiency of the proposed approach.
picture_as_pdf View full article Reproducible Research
Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours (880)
Dimitrios Stamoulis (Carnegie Mellon University), Ruizhou Ding (Carnegie Mellon University), Di Wang (Microsoft), Dimitrios Lymberopoulos (Microsoft), Bodhi Priyantha (Microsoft), Jie Liu (Harbin Institute of Technology), Diana Marculescu (Carnegie Mellon University)

Can we automatically design a Convolutional Network (ConvNet) with thehighest image classification accuracy under the latency constraint of a mobile device? Neural architecture search (NAS) has revolutionizedthe design of hardware-efficient ConvNets by automating this process.However, the NAS problem remains challenging due to the combinatorially large design space, causing a significant searching time (at least 200 GPU-hours). To alleviate this complexity, we propose Single-Path NAS, a novel differentiable NAS method for designing hardware-efficient ConvNets in less than 4 hours. Our contributions are as follows: 1. Single-path search space: Compared to previous differentiable NAS methods, Single-Path NAS uses one single-path over-parameterized ConvNet to encode all architectural decisions with shared convolutionalkernel parameters, hence drastically decreasing the number oftrainable parameters and the search cost down to few epochs. 2. Hardware-efficient ImageNet classification: Single-Path NAS achieves 74.96
picture_as_pdf View full article Reproducible Research
Bayesian Generalized Horseshoe Estimation of Generalized Linear Models (902)
Daniel F. Schmidt (Monash University; University of Melbourne), Enes Makalic (University of Melbourne)

Bayesian global-local shrinkage estimation with the generalized horseshoe prior represents the state-of-the-art for Gaussian regression models. The extension to non-Gaussian data, such as binary or Student-t regression, is usually done by exploiting a scale-mixture-of-normals approach. However, many standard distributions, such as the gamma and the Poisson, do not admit such a representation. We contribute two extensions to global-local shrinkage methodology. The first is an adaption of recent auxiliary gradient based-sampling schemes to the global-local shrinkage framework, which yields simple algorithms for sampling from generalized linear models. We also introduce two new samplers for the hyperparameters in the generalized horseshoe model, one based on an inverse-gamma mixture of inverse-gamma distributions, and the second a rejection sampler. Results show that these new samplers are highly competitive with the no U-turn sampler for small numbers of predictors, and potentially perform better for larger numbers of predictors. Results for hyperparameter sampling show our new inverse-gamma inverse-gamma based sampling scheme outperforms the standard sampler based on a gamma mixture of gamma distributions.
picture_as_pdf View full article Reproducible Research
Learning to Signal in the Goldilocks Zone: Improving Adversary Compliance in Security Games (923)
Sarah Cooney (University of Southern California), Kai Wang (University of Southern California), Elizabeth Bondi (University of Southern California), Thanh Nguyen (University of Oregon), Phebe Vayanos (University of Southern California), Hailey Winetrobe (University of Southern California), Edward A. Cranford (Carnegie Mellon University), Cleotilde Gonzalez (Carnegie Mellon University), Christian Lebiere (Carnegie Mellon University), Milind Tambe (University of Southern California)

Many real-world security scenarios can be modeled via a game-theoretic framework known as a security game in which there is a defender trying to protect potential targets from an attacker. Recent work in security games has shown that deceptive signaling by the defender can convince an attacker to withdraw his attack. For instance, a warning message to commuters indicating speed enforcement is in progress ahead might lead to them driving more slowly, even if it turns out no enforcement is in progress. However, the results of this work are limited by the unrealistic assumption that the attackers will behave with perfect rationality, meaning they always choose an action that gives them the best expected reward. We address the problem of training boundedly rational (human) attackers to comply with signals via repeated interaction with signaling without incurring a loss to the defender, and offer the four following contributions: (i) We learn new decision tree and neural network-based models of attacker compliance with signaling. (ii) Based on these machine learning models of a boundedly rational attacker's response to signaling, we develop a theory of signaling in the Goldilocks zone, a balance of signaling and deception that increases attacker compliance and improves defender utility. (iii) We present game-theoretic algorithms to solve for signaling schemes based on the learned models of attacker compliance with signaling. (iv) We conduct extensive human subject experiments using an online game. The game simulates the scenario of an inside attacker trying to steal sensitive information from company computers, and results show that our algorithms based on learned models of attacker behavior lead to better attacker compliance and improved defender utility compared to the state-of-the-art algorithm for rational attackers with signaling.
picture_as_pdf View full article
Fine-Grained Explanations using Markov Logic (937)
Khan Mohammad Al Farabi (University of Memphis), Somdeb Sarkhel (Adobe Research), Sanorita Dey (University of Illinois at Urbana-Champaign), Deepak Venugopal (University of Memphis)

Explaining the results of Machine learning algorithms is crucial given the rapid growth and potential applicability of these methods in critical domains including healthcare, defense, autonomous driving, etc. In this paper, we address this problem in the context of Markov Logic Networks (MLNs) which are highly expressive statistical relational models that combine first-order logic with probabilistic graphical models. MLNs in general are known to be interpretable models, i.e., MLNs can be understood more easily by humans as compared to models learned by approaches such as deep learning. However, at the same time, it is not straightforward to obtain human-understandable explanations specific to an observed inference result (e.g. marginal probability estimate). This is because, the MLN provides a lifted interpretation, one that generalizes to all possible worlds/instantiations, which are not query/evidence specific. In this paper, we extract grounded-explanations, i.e., explanations defined w.r.t specific inference queries and observed evidence. We extract these explanations from importance weights defined over the MLN formulas that encode the contribution of formulas towards the final inference results. We validate our approach in real world problems related to analyzing reviews from Yelp, and show through user-studies that our explanations are richer than state-of-the-art non-relational explainers such as LIME.
picture_as_pdf View full article
Generative Adversarial Networks for Failure Prediction (23)
Shuai Zheng (Industrial AI Lab, Hitachi America Ltd), Ahmed Farahat (Industrial AI Lab, Hitachi America Ltd), Chetan Gupta (Industrial AI Lab, Hitachi America Ltd)

Prognostics and Health Management (PHM) is an emerging engineering discipline which is concerned with the analysis and prediction of equipment health and performance. One of the key challenges in PHM is to accurately predict impending failures in the equipment. In recent years, solutions for failure prediction have evolved from building complex physical models to the use of machine learning algorithms that leverage the data generated by the equipment. However, failure prediction problems pose a set of unique challenges that make direct application of traditional classification and prediction algorithms impractical. These challenges include the highly imbalanced training data, the extremely high cost of collecting more failure samples, and the complexity of the failure patterns. Traditional oversampling techniques will not be able to capture such complexity and accordingly result in overfitting the training data. This paper addresses these challenges by proposing a novel algorithm for failure prediction using Generative Adversarial Networks (GAN-FP). GAN-FP first utilizes two GAN networks to simultaneously generate training samples and build an inference network that can be used to predict failures for new samples. GAN-FP first adopts an infoGAN to generate realistic failure and non-failure samples, and initialize the weights of the first few layers of the inference network. The inference network is then tuned by optimizing a weighted loss objective using only real failure and non-failure samples. The inference network is further tuned using a second GAN whose purpose is to guarantee the consistency between the generated samples and corresponding labels. GAN-FP can be used for other imbalanced classification problems as well. Empirical evaluation on several benchmark datasets demonstrates that GAN-FP significantly outperforms existing approaches, including under-sampling, SMOTE, ADASYN, weighted loss, and infoGAN augmented training.
picture_as_pdf View full article
Scalable Bid Landscape Forecasting in Real-time Bidding (123)
Aritra Ghosh (University of Massachusetts, Amherst), Saayan Mitra (Adobe Research), Somdeb Sarkhel (Adobe Research), Jason Xie (Adobe Advertising Cloud), Gang Wu (Adobe Research), Viswanathan Swaminathan (Adobe Research)

In programmatic advertising, ad slots are usually sold using second-price (SP) auctions in real-time. The highest bidding advertiser wins but pays only the second highest bid (known as the winning price). In SP, for a single item, the dominant strategy of each bidder is to bid the true value from the bidder's perspective. However, in a practical setting, with budget constraints, bidding the true value is a sub-optimal strategy. Hence, to devise an optimal bidding strategy, it is of utmost importance to learn the winning price distribution accurately. Moreover, a demand-side platform (DSP), which bids on behalf of advertisers, observes the winning price if it wins the auction. For losing auctions, DSPs can only treat its bidding price as the lower bound for the unknown winning price. In literature, typically censored regression is used to model such partially observed data. A common assumption in censored regression is that the winning price is drawn from a fixed variance (homoscedastic) uni-modal distribution (most often Gaussian). However, in reality, these assumptions are often violated.We relax these assumptions and propose a heteroscedastic fully parametric censored regression approach, as well as a mixture density censored network.Our approach not only generalizes censored regression but also provides flexibility to model arbitrarily distributed real-world data. Experimental evaluation on the publicly available dataset for winning price estimation demonstrates the effectiveness of our method. Furthermore, we evaluate our algorithm on one of thelargest demand-side platform and significant improvement has been achievedin comparison with the baseline solutions.
picture_as_pdf View full article
Interpreting atypical conditions in systems with deep conditional Autoencoders: the case of electrical consumption (175)
Antoine Marot (Reseau Transport Electricite R&D), Antoine Rosin (Reseau Transport Electricite R&D), Laure Crochepierre (Reseau Transport Electricite R&D; Universite de Lorraine), Benjamin Donnot (Reseau Transport Electricite R&D), Pierre Pinson (DTU Technical University of Denmark), Lydia Boudjeloud-Assala (Universite de Lorraine)

In this paper, we propose a new method to iteratively and interactively characterize new feature conditions for signals of daily French electrical consumption from our historical database, relying on Conditional Variational Autoencoders. An autoencoder first learn a compressed similarity-based representation of the signals in a latent space, in which one can select and extract well-represented expert features. Then, we successfully condition the model over the set of extracted features, as opposed to simple target label previously, to learn conditionally independent new residual latent representations. Unknown, or previously unselected factors such as atypical conditions now appear well-represented to be detected and further interpreted by experts. By applying it, we recover the appropriate known expert features and eventually discover, through adapted representations, atypical known and unknown conditions such as holidays, fuzzy non working days and weather events, which were actually related to important events that influenced consumption.
picture_as_pdf View full article Reproducible Research
Manufacturing Dispatching using Reinforcement and Transfer Learning (225)
Shuai Zheng (Industrial AI Lab, Hitachi America Ltd), Chetan Gupta (Industrial AI Lab, Hitachi America Ltd), Susumu Serita (Industrial AI Lab, Hitachi America Ltd)

Efficient dispatching rule in manufacturing industry is key to ensure product on-time delivery and minimum past-due and inventory cost. Manufacturing, especially in the developed world, is moving towards on-demand manufacturing meaning a high mix, low volume product mix. This requires efficient dispatching that can work in dynamic and stochastic environments, meaning it allows for quick response to new orders received and can work over a disparate set of shop floor settings. In this paper we address this problem of dispatching in manufacturing. Using reinforcement learning (RL), we propose a new design to formulate the shop floor state as a 2-D matrix, incorporate job slack time into state representation, and design lateness and tardiness rewards function for dispatching purpose. However, maintaining a separate RL model for each production line on a manufacturing shop floor is costly and often infeasible. To address this, we enhance our deep RL model with an approach for dispatching policy transfer. This increases policy generalization and saves time and cost for model training and data collection. Experiments show that: (1) our approach performs the best in terms of total discounted reward and average lateness, tardiness, (2) the proposed policy transfer approach reduces training time and increases policy generalization.
picture_as_pdf View full article
Automatic Recognition of Student Engagement using Deep Learning and Facial Expression (241)
Omid Mohamad Nezami (Macquarie University; CSIRO's Data61), Mark Dras (Macquarie University), Len Hamey (Macquarie University), Deborah Richards (Macquarie University), Stephen Wan (CSIRO's Data61), Cécile Paris (CSIRO's Data61)

Engagement is a key indicator of the quality of learning experience, and one that plays a major role in developing intelligent educational interfaces. Any such interface requires the ability to recognise the level of engagement in order to respond appropriately; however, there is very little existing data to learn from, and new data is expensive and difficult to acquire. This paper presents a deep learning model to improve engagement recognition from images that overcomes the data sparsity challenge by pre-training on readily available basic facial expression data, before training on specialised engagement data. In the first of two steps, a facial expression recognition model is trained to provide a rich face representation using deep learning. In the second step, we use the model's weights to initialize our deep learning based model to recognize engagement; we term this the engagement model. We train the model on our new engagement recognition dataset with 4627 engaged and disengaged samples. We find that the engagement model outperforms effective deep learning architectures that we apply for the first time to engagement recognition, as well as approaches using histogram of oriented gradients and support vector machines.
picture_as_pdf View full article Reproducible Research
Augmenting Semantic Representation of Depressive Language: from Forums to Microblogs (259)
Nawshad Farruque (University of Alberta), Osmar Zaiane (University of Alberta), Randy Goebel (University of Alberta)

We discuss and analyze the process of creating word embedding feature representations specifically designed for a learning task when annotated data is scarce, like depressive language detection from Tweets. We start from rich word embedding pre-trained from a general dataset, then enhance it with embedding learned from a domain specific but relatively much smaller dataset. Our strengthened representation portrays better the domain of depression we are interested in as it combines the semantics learned from the specific domain and word coverage from the general language. We present a comparative analyses of our word embedding representations with a simple bag-of-words model, a well known sentiment lexicon, a psycholinguistic lexicon, and a general pre-trained word embedding, based on their efficacy in accurately identifying depressive Tweets. We show that our representations achieves a significantly better F1 score than the others when applied to a high quality dataset.
picture_as_pdf View full article Reproducible Research
An aggregate learning approach for interpretable semi-supervised population prediction and disaggregation using ancillary data (284)
Guillaume Derval (ICTEAM, UCLouvain), Frédéric Docquier (IRES, UCLouvain), Pierre Schaus (ICTEAM, UCLouvain)

Census data provide detailed information about population characteristics at a coarse resolution. Nevertheless, fine-grained, high-resolution mappings of population counts are increasingly needed to characterize population dynamics and to assess the consequences of climate shocks, natural disasters, investments in infrastructure, development policies, etc. Dissagregating these census is a complex machine learning, and multiple solutions have been proposed in past research. We propose in this paper to view the problem in the context of the aggregate learning paradigm, where the output value for all training points is not known, but where it is only known for aggregates of the points (i.e. in this context, for regions of pixels where a census is available). We demonstrate with a very simple and interpretable model that this method is on par, and even outperforms on some metrics, the state-of-the-art, despite its simplicity.
picture_as_pdf View full article Reproducible Research
Augmenting Physiological Time Series Data: A Case Study for Sleep Apnea Detection (293)
Konstantinos Nikolaidis (University of Oslo), Stein Kristiansen (University of Oslo), Vera Goebel (University of Oslo), Thomas Plagemann (University of Oslo), Knut Liestøl (University of Oslo), Mohan Kankanhalli (National University of Singapore)

Supervised machine learning applications in the health domain often face the problem of insufficient training datasets. The quantity of labelled data is small due to privacy concerns and the cost of data acquisition and labelling by a medical expert. Furthermore, it is quite common that collected data are unbalanced and getting enough data to personalize models for individuals is very expensive or even infeasible. This paper addresses these problems by (1) designing a recurrent Generative Adversarial Network to generate realistic synthetic data and to augment the original dataset, (2) enabling the generation of balanced datasets based on a heavily unbalanced dataset, and (3) to control the data generation in such a way that the generated data resembles data from specific individuals. We apply these solutions for sleep apnea detection and study in the evaluation the performance of four well-known techniques, i.e., K-Nearest Neighbour, Random Forest, Multi-Layer Perceptron, and Support Vector Machine. All classifiers exhibit in the experiments a consistent increase in sensitivity and a kappa statisticincrease by between 0.72· 10^-2 and 18.2· 10^-2.
picture_as_pdf View full article
Marine Mammal Species Classification using Convolutional Neural Networks and a Novel Acoustic Representation (314)
Mark Thomas (Dalhousie University Faculty of Computer Science), Bruce Martin (JASCO Applied Sciences), Katie Kowarski (JASCO Applied Sciences), Briand Gaudet (JASCO Applied Sciences), Stan Matwin (Dalhousie University Faculty of Computer Science; Institute of Computer Science Polish Academy of Sciences)

Research into automated systems for detecting and classifying marine mammals in acoustic recordings is expanding internationally due to the necessity to analyze large collections of data for conservation purposes. In this work, we present a Convolutional Neural Network that is capable of classifying the vocalizations of three species of whales, non-biological sources of noise, and a fifth class pertaining to ambient noise. In this way, the classifier is capable of detecting the presence and absence of whale vocalizations in an acoustic recording. Through transfer learning, we show that the classifier is capable of learning high-level representations and can generalize to additional species. We also propose a novel representation of acoustic signals that builds upon the commonly used spectrogram representation by way of interpolating and stacking multiple spectrograms produced using different Short-time Fourier Transform (STFT) parameters. The proposed representation is particularly effective for the task of marine mammal species classification where the acoustic events we are attempting to classify are sensitive to the parameters of the STFT.
picture_as_pdf View full article
LSTM encoder-predictor for short-term train load forecasting (352)
Kevin Pasini (Université Paris-Est; IRT SystemX), Mostepha Khouadjia (Université Paris-Est), Allou Samé (IRT SystemX), Fabrice Ganansia (SNCF- Innovation & Recherche), Latifa Oukhellou (IRT SystemX)

The increase in the amount of data collected in the transport domain can greatly benefit mobility studies and help to create high value-added mobility services for passengers as well as regulation tools for operators. The research detailed in this paper is related to the development of an advanced machine learning approach with the aim of forecasting the passenger load of trains in public transport. Predicting the crowding level on public transport can indeed be useful for enriching the information available to passengers to enable them to better plan their daily trips. Moreover, operators will increasingly need to assess and predict network passenger load to improve train regulation processes and service quality levels. The main issues to address in this forecasting task are the variability in the train load series induced by the train schedule and the influence of several contextual factors, such as calendar information. We propose a neural network LSTM encoder-predictor combined with a contextual representation learning to address this problem. Experiments are conducted on a real dataset provided by the French railway company SNCF and collected over a period of one and a half years. The prediction performance provided by the proposed model are compared to those given by historical models and by traditional machine learning models. The obtained results have demonstrated the potential of the proposed LSTM encoder-predictor to address both one-step-ahead and multi-step forecasting and to outperform other models by maintaining robustness in the quality of the forecasts throughout the time horizon.
picture_as_pdf View full article
Learning Disentangled Representations of Satellite Image Time Series (372)
Eduardo H. Sanchez (IRT Saint Exupéry, Toulouse; IRIT, Université Toulouse III - Paul Sabatier), Mathieu Serrurier (IRT Saint Exupéry, Toulouse; IRIT, Université Toulouse III - Paul Sabatier), Mathias Ortner (IRT Saint Exupéry, Toulouse)

In this paper, we investigate how to learn a suitable representation of satellite image time series in an unsupervised manner by leveraging large amounts of unlabeled data. Additionally, we aim to disentangle the representation of time series into two representations: a shared representation that captures the common information between the images of a time series and an exclusive representation that contains the specific information of each image of the time series. To address these issues, we propose a model that combines a novel component called cross-domain autoencoders with the variational autoencoder (VAE) and generative adversarial network (GAN) methods. In order to learn disentangled representations of time series, our model learns the multimodal image-to-image translation task. We train our model using satellite image time series provided by the Sentinel-2 mission. Several experiments are carried out to evaluate the obtained representations. We show that these disentangled representations can be very useful to perform multiple tasks such as image classification, image retrieval, image segmentation and change detection.
picture_as_pdf View full article
A Deep Multi-Task Approach for Residual Value Forecasting (410)
Ahmed Rashed (University of Hildesheim), Shayan Jawed (University of Hildesheim), Jens Rehberg (Volkswagen Financial Services AG, Braunschweig), Josif Grabocka (University of Hildesheim), Lars Schmidt-Thieme (University of Hildesheim), Andre Hintsches (Volkswagen Financial Services AG, Braunschweig)

Residual value forecasting plays an important role in many areas, e.g., for vehicles to price leasing contracts. High forecasting accuracy is crucial as any overestimation will lead to lost sales due to customer dissatisfaction, while underestimation will lead to a direct loss in revenue when reselling the car at the end of the leasing contract. Current forecasting models mainly rely on the trend analysis of historical sales records. However, these models require extensive manual steps to filter and preprocess those records which in term limits the frequency at which these models can be updated with new data. To automate, improve and speed up residual value forecasting we propose a multi-task model that utilizes besides the residual value itself as the main target, the actual mileage that has been driven as a co-target. While combining off-the-shelf regression models with careful feature engineering yields already useful models, we show that for residual values further quantities such as the actual driven mileage contains further useful information. As this information is not available when contracts are made and the forecast is due, we model the problem as a multi-task model that uses actual mileage as a training co-target. Experiments on three Volkswagen car models show that the proposed model significantly outperforms the straight-forward modeling as a single-target regression problem.
picture_as_pdf View full article
Characterization and Early Detection of Evergreen News Articles (419)
Yiming Liao (Pennsylvania State University), Shugang Wang (The Washington Post), Eui-Hong (Sam) Han (Marriott International), Jongwuk Lee (Sungkyunkwan University), Dongwon Lee (Pennsylvania State University)

Although the majority of news articles are only viewed for days or weeks, there are a small fraction of news articles that are read across years, thus named as evergreen news articles. Because evergreen articles maintain a timeless quality and are consistently of interests to the public, understanding their characteristics better has huge implications for news outlets and platforms yet there are few studies that have explicitly investigated on evergreen articles. Addressing this gap, in this paper, we first propose a flexible parameterized definition of evergreen articles to capture their long-term high traffic patterns. Using a real dataset from the Washington Post, then, we unearth several distinctive characteristics of evergreen articles and build an early prediction model with encouraging results. Although less than 1
picture_as_pdf View full article
Transfer Learning in Credit Risk (491)
Hendra Suryanto (Rich Data Coroporation, Singapore), Charles Guan (Rich Data Coroporation, Singapore), Andrew Voumard (Rich Data Coroporation, Singapore), Ghassan Beydoun (University of Technology Sydney)

In the credit risk domain, lenders frequently face situations where there is no, or limited historical lending outcome data. This generally results in limited or unaffordable credit for some individuals and small businesses. Transfer learning can potentially reduce this limitation, by leveraging knowledge from related domains, with sufficient outcome data. We investigated the potential for applying transfer learning across various credit domains, for example, from the credit card lending and debt consolidation domain into the small business lending domain.
picture_as_pdf View full article Reproducible Research
Wearable-based Parkinson's Disease Severity Monitoring using Deep Learning (575)
Jann Goschenhofer (Dept. of Statistics, Ludwig-Maximilians University, Munich; ConnectedLife GmbH, Munich), Franz M. J. Pfister (Dept. of Statistics, Ludwig-Maximilians University, Munich; ConnectedLife GmbH, Munich), Kamer Ali Yuksel (ConnectedLife GmbH, Munich), Bernd Bischl (Dept. of Statistics, Ludwig-Maximilians University, Munich), Urban Fietzek (Dept. of Neurology, Ludwig-Maximilians University, Munich; Dept. of Neurology), Janek Thomas (Clinical Neurophysiology, Schoen Clinic Schwabing)

One major challenge in the medication of Parkinson's disease is that the severity of the disease, reflected in the patients' motor state, cannot be measured using accessible biomarkers. Therefore, we develop and examine a variety of statistical models to detect the motor state of such patients based on sensor data from a wearable device. We find that deep learning models consistently outperform a classical machine learning model applied on hand-crafted features in this time series classification task. Furthermore, our results suggest that treating this problem as a regression instead of an ordinal regression or a classification task is most appropriate. For consistent model evaluation and training, we adopt the leave-one-subject-out validation scheme to the training of deep learning models. We also employ a class-weighting scheme to successfully mitigate the problem of high multi-class imbalances in this domain. In addition, we propose a customized performance measure that reflects the requirements of the involved medical staff on the model. To solve the problem of limited availability of high quality training data, we propose a transfer learning technique which helps to improve model performance substantially. Our results suggest that deep learning techniques offer a high potential to autonomously detect motor states of patients with Parkinson's disease.
picture_as_pdf View full article
Optimizing Neural Networks for Patent Classification (619)
Louay Abdelgawad (Averbis GmbH, Freiburg), Peter Kluegl (Averbis GmbH, Freiburg), Erdan Genc (Averbis GmbH, Freiburg), Stefan Falkner (Albert-Ludwigs University of Freiburg), Frank Hutter (Albert-Ludwigs University of Freiburg)

A great number of patents is filed everyday to the patent offices worldwide. Each of these patents has to be labeled by domain experts with one or many of thousands of categories. This process is not only extremely expensive but also overwhelming for the experts, due to the considerable increase of filed patents over the years and the increasing complexity of the hierarchical categorization structure. Therefore, it is critical to automate the manual classification process using a classification model. In this paper, the automation of the task is carried out based on recent advances in deep learning for NLP and compared to customized approaches. Moreover, an extensive optimization analysis grants insights about hyperparameter importance. Our optimized convolutional neural network achieves a new state-of-the-art performance of 55.02
picture_as_pdf View full article Reproducible Research
Investigating Time Series Classification Techniques for Rapid Pathogen Identification with Single-Cell MALDI-TOF Mass Spectrum Data (620)
Christina Papagiannopoulou (Ghent University), René Parchen (BiosparQ B.V., Leiden), Willem Waegeman (Ghent University)

Matrix-assisted laser desorption/ionization-time-of-flight mass spectrometry (MALDI-TOF-MS) is a well-known technology, widely used in species identification. Specifically, MALDI-TOF-MS is applied on samples that usually include bacterial cells, generating representative signals for the various bacterial species. However, for a reliable identification result, a significant amount of biomass is required. For most samples used for diagnostics of infectious diseases, the sample volume is extremely low to obtain the required amount of biomass. Therefore, amplification of the bacterial load is performed by a culturing phase. If the MALDI process could be applied to individual bacteria, it would be possible to circumvent the need for culturing and isolation, accelerating the whole process. In this paper, we briefly describe an implementation of a MALDI-TOF MS procedure in a setting of individual cells and we demonstrate the use of the produced data for the application of pathogen identification. The identification of pathogens (bacterial species) is performed by using machine learning algorithms on the generated single-cell signals. The high predictive performance of the machine learning models indicates that the produced bacterial signatures constitute an informative representation, helpful in distinguishing the different bacterial species. In addition, we reformulate the bacterial species identification problem as a time series classification task by considering the intensity sequences of a given spectrum as time series values. Experimental results show that algorithms originally introduced for time series analysis are beneficial in modelling observations of single-cell MALDI-TOF MS.
picture_as_pdf View full article
The Search for Equations - Learning to Identify Similarities between Mathematical Expressions (657)
Lukas Pfahler (TU Dortmund University), Jonathan Schill (TU Dortmund University), Katharina Morik (TU Dortmund University)

On your search for scientific articles relevant to your research question, you judge the relevance of a mathematical expression that you stumble upon using extensive background knowledge about the domain, its problems and its notations. We wonder if machine learning can support this process and work toward implementing a search engine for mathematical expressions in scientific publications.Thousands of scientific publication with millions of mathematical expressions or equations are accessible at arXiv.org. We want to use this data to learn about equations, their distribution and their relations in order to find similar equations.To this end we propose an embedding model based on convolutional neural networks that maps bitmap images of equations into a low-dimensional vector-space where similarity is evaluated via dot-product.However, no annotated similarity data is available to train this mapping. We mitigate this by proposing a number of different unsupervised proxy tasks that use available featuresas weak labels. We evaluate our system using a number of metrics, including results on a small hand-labeled subset of equations. In addition, we show and discuss a number of result-sets for some sample queries.The results show that we are able to automatically identify related mathematical expressions.Our dataset is published at https://whadup.github.io/EquationLearning/and we invite the community to use it.
picture_as_pdf View full article Reproducible Research
Pushing the Limits of Exoplanet Discovery via Direct Imaging with Deep Learning (680)
Kai Hou Yip (University College London), Nikolaos Nikolaou (University College London), Piero Coronica (University of Cambridge), Angelos Tsiaras (University College London), Billy Edwards (University College London), Quentin Changeat (University College London), Mario Morvan (University College London), Beth Biller (University of Edinburgh), Sasha Hinkley (University of Exeter), Jeffrey Salmond (University of Cambridge), Matthew Archer (University of Cambridge), Paul Sumption (University of Cambridge), Elodie Choquet (Aix Marseille Univ), Remi Soummer (STScI), Laurent Pueyo (STScI), Ingo P. Waldmann (University College London)

Further advances in exoplanet detection and characterisation require sampling a diverse population of extrasolar planets. One technique to detect these distant worlds is through the direct detection of their thermal emission. The so-called direct imaging technique, is suitable for observing young planets far from their star. These are very low signal-to-noise-ratio (SNR) measurements and limited ground truth hinders the use of supervised learning approaches. In this paper, we combine deep generative and discriminative models to bypass the issues arising when directly training on real data. We use a Generative Adversarial Network to obtain a suitable dataset for training Convolutional Neural Network classifiers to detect and locate planets across a wide range of SNRs. Tested on artificial data, our detectors exhibit good predictive performance and robustness across SNRs. To demonstrate the limits of the detectors, we provide maps of the precision and recall of the model per pixel of the input image. On real data, the models can re-confirm bright source detections.
picture_as_pdf View full article Reproducible Research
Cold-Start Recommendation for On-Demand Cinemas (689)
Beibei Li (State Key Laboratory of Computer Sciences, Institute of Software, ChineseAcademy of Sciences; University of Chinese Academy of Sciences), Beihong Jin (State Key Laboratory of Computer Sciences, Institute of Software, ChineseAcademy of Sciences; University of Chinese Academy of Sciences), Taofeng Xue (State Key Laboratory of Computer Sciences, Institute of Software, ChineseAcademy of Sciences; University of Chinese Academy of Sciences), Kunchi Liu (State Key Laboratory of Computer Sciences, Institute of Software, ChineseAcademy of Sciences; University of Chinese Academy of Sciences), Qi Zhang (Beijing iQIYI Cinema Management Co., Ltd.), Sihua Tian (Beijing iQIYI Cinema Management Co., Ltd.)

The on-demand cinemas, which has emerged in recent years, provide offline entertainment venues for individuals and small groups. Because of the limitation of network speed and storage space, it is necessary to recommend movies to cinemas, that is, to suggest cinemas to download the recommended movies in advance. This is particularly true for new cinemas. For the new cinema cold-start recommendation, we build a matrix factorization framework and then fuse location categories of cinemas and co-popular relationship between movies in the framework. Specifically, location categories of cinemas are learned through LDA from the type information of POIs around the cinemas and used to approximate cinema latent representations. Moreover, a SPPMI matrix that reflects co-popular relationship between movies is constructed and factorized collectively with the interaction matrix by sharing the same item latent representations. Extensive experiments on real-world data are conducted. The experimental results show that the proposed approach yields significant improvements over state-of-the-art cold-start recommenders in terms of precision, recall and NDCG.
picture_as_pdf View full article
Data-driven Policy on Feasibility Determination for the Train Shunting Problem (690)
Paulo Roberto de Oliveira da Costa (Eindhoven University of Technology), Jason Rhuggenaath (Eindhoven University of Technology), Yingqian Zhang (Eindhoven University of Technology), Alp Akcay (Eindhoven University of Technology), Wan-Jui Lee (Dutch Railways), Uzay Kaymak (Eindhoven University of Technology)

Parking, matching, scheduling, and routing are common problems in train maintenance. In particular, train units are commonly maintained and cleaned at dedicated shunting yards. The planning problem that results from such situations is referred to as the Train Unit Shunting Problem (TUSP). This problem involves matching arriving train units to service tasks and determining the schedule for departing trains. The TUSP is an important problem as it is used to determine the capacity of shunting yards and arises as a sub-problem of more general scheduling and planning problems. In this paper, we consider the case of the Dutch Railways (NS) TUSP. As the TUSP is complex, NS currently uses a local search (LS) heuristic to determine if an instance of the TUSP has a feasible solution. Given the number of shunting yards and the size of the planning problems, improving the evaluation speed of the LS brings significant computational gain. In this work, we use a machine learning approach that complements the LS and accelerates the search process. We use a Deep Graph Convolutional Neural Network (DGCNN) model to predict the feasibility of solutions obtained during the run of the LS heuristic. We use this model to decide whether to continue or abort the search process. In this way, the computation time is used more efficiently as it is spent on instances that are more likely to be feasible. Using simulations based on real-life instances of the TUSP, we show how our approach improves upon the previous method on prediction accuracy and leads to computational gains for the decision-making process.
picture_as_pdf View full article
Player Vectors: Characterizing Soccer Players' Playing Style from Match Event Streams (701)
Tom Decroos (KU Leuven), Jesse Davis (KU Leuven)

Transfer fees for soccer players are at an all-time high. To make the most of their budget, soccer clubs need to understand the type of players they have and the type of players that are on the market. Current insights in the playing style of players are mostly based on the opinions of human soccer experts such as trainers and scouts. Unfortunately, their opinions are inherently subjective and thus prone to faults. In this paper, we characterize the playing style of a player in a more rigorous, objective and data-driven manner. We characterize the playing style of a player using a so-called `player vector' that can be interpreted both by human experts and machine learning systems. We demonstrate the validity of our approach by retrieving player identities from anonymized event stream data and present a number of use cases related to scouting and monitoring player development in top European competitions.
picture_as_pdf View full article
Shallow Self-Learning for Reject Inference in Credit Scoring (730)
Nikita Kozodoi (Humboldt University of Berlin; Kreditech, Hamburg), Panagiotis Katsas (Kreditech, Hamburg), Stefan Lessmann (Humboldt University of Berlin), Luis Moreira-Matias (Kreditech, Hamburg), Konstantinos Papakonstantinou (Kreditech, Hamburg)

Credit scoring models support loan approval decisions in the financial services industry. Lenders train these models on data from previously granted credit applications, where the borrowers' repayment behavior has been observed. This approach creates sample bias. The scoring model is trained on accepted cases only. Applying the model to screen applications from the population of all borrowers degrades its performance. Reject inference comprises techniques to overcome sampling bias through assigning labels to rejected cases. This paper makes two contributions. First, we propose a self-learning framework for reject inference. The framework is geared toward real-world credit scoring requirements through considering distinct training regimes for labeling and model training. Second, we introduce a new measure to assess the effectiveness of reject inference strategies. Our measure leverages domain knowledge to avoid artificial labeling of rejected cases during evaluation. We demonstrate this approach to offer a robust and operational assessment of reject inference. Experiments on a real-world credit scoring data set confirm the superiority of the suggested self-learning framework over previous reject inference strategies. We also find strong evidence in favor of the proposed evaluation measure assessing reject inference strategies more reliably, raising the performance of the eventual scoring model.
picture_as_pdf View full article
J3R: Joint Multi-task Learning of Ratings and Review Summaries for Explainable Recommendation (742)
Avinesh P.V.S. (Technische Universität Darmstadt), Yongli Ren (RMIT University), Christian M. Meyer (Technische Universität Darmstadt), Jeffrey Chan (RMIT University), Zhifeng Bao (RMIT University), Mark Sanderson (RMIT University)

We learn user preferences from ratings and reviews by using multi-task learning (MTL) of rating prediction and summarization of item reviews. Reviews of an item tend to describe detailed user preferences (e.g., the cast, genre, or screenplay of a movie). A summary of such a review or a rating describes an overall user experience of the item.Our objective is to learn latent vectors which are shared across rating prediction and review summary generation.Additionally, the learned latent vectors and the generated summary act as explanations for the recommendation.Our MTL-based approach J3R uses a multi-layer perceptron for rating prediction, combined with pointer-generator networks with attention mechanism for the summarization component.We provide empirical evidence for joint learning of rating prediction and summary generation being beneficial for recommendation by conductingexperiments on the Yelp dataset and six domains of the Amazon 5-core dataset. Additionally, we provide two ways of explanations visualizing (a) the user vectors on different topics of a domain, computed from our J3R approach and (b) a ten-word review summary of a review and the attention highlights generated on the review based on the user-item vectors.
picture_as_pdf View full article
Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge (743)
Lidia Contreras-Ochando (Universitat Politècnica de València), Cèsar Ferri (Universitat Politècnica de València), José Hernández-Orallo (Universitat Politècnica de València), Fernando Martínez-Plumed (Universitat Politècnica de València), María José Ramírez-Quintana (Universitat Politècnica de València), Susumu Katayama (University of Miyazaki)

Data quality is essential for database integration, machine learning and data science in general. Despite the increasing number of tools for data preparation, the most tedious tasks of data wrangling -and feature manipulation in particular- still resist automation partly because the problem strongly depends on domain information. For instance, if the strings "17th of August of 2017" and "2017-08-17" are to be formatted into "08/17/2017" to be properly recognised by a data analytics tool, humans usually process this in two steps: (1) they recognise that this is about dates and (2) they apply conversions that are specific to the date domain. However, the mechanisms to manipulate dates are very different from those to manipulate addresses. This requires huge amounts of background knowledge, which usually becomes a bottleneck as the diversity of domains and formats increases. In this paper we help alleviate this problem by using inductive programming (IP) with a dynamic background knowledge (BK) fuelled by a machine learning meta-model that selects the domain, the primitives (or both) from several descriptive features of the data wrangling problem. We illustrate these new alternatives for the automation of data format transformation, which we evaluate on an integrated benchmark and code for data wrangling, which we share publicly for the community.
picture_as_pdf View full article Reproducible Research
A Semi-Supervised and Online Learning Approach for Non-Intrusive Load Monitoring (844)
Hajer Salem (Institut Mines-Télécom Lille Douai; Manouba University), Moamar Sayed-Mouchaweh (Institut Mines-Télécom Lille Douai)

Non-Intrusive Load Monitoring (NILM) approaches aim at identifying the consumption of a single appliance from the total load provided by smart meters. Several research works based on Hidden Markov Models (HMM) were developed for NILM where training is performed offline. However, these approaches suffer from different issues: First, they fail to generalize to unseen appliances with different configurations or brands than the ones used for training. Second, obtaining data about all active states of each appliance requires long time, which is impractical for residents. Third, offline training requires storage of huge amount of data, yielding to share resident consumption data with external servers and causing privacy issues. Therefore, in this paper, a new approach is proposed in order to tackle these issues. This approach is based on the use of a HMM conditioned on discriminant contextual features (e.g., time of usage, duration of usage). The conditional HMM (CHMM) is trained online using data related to a single appliance consumption extracted from aggregated load in order to adapt its parameters to the appliance specificity's (e.g., brand, configuration, etc.). Experiments are performed using real data from publicly available data sets and comparative evaluation are performed on a publicly available NILM framework.
picture_as_pdf View full article
Compact Representation of a Multi-dimensional Combustion Manifold Using Deep Neural Networks (863)
Sushrut Bhalla (University of Waterloo), Matthew Yao (University of Waterloo), Jean-Pierre Hickey (University of Waterloo), Mark Crowley (University of Waterloo)

The computational challenges in turbulent combustion simulations stem from the physical complexities and multi-scale nature of the problem which make it intractable to compute scale-resolving simulations. For most engineering applications, the large scale separation between the flame (typically sub-millimeter scale) and the characteristic turbulent flow (typically centimeter or meter scale) allows us to evoke simplifying assumptions -such as done for the flamelet model- to pre-compute all the chemical reactions and map them to a low-order manifold. The resulting manifold is then tabulated and looked-up at run-time. As the physical complexity of combustion simulations increases (including radiation, soot formation, pressure variations etc.) the dimensionality of the resulting manifold grows which impedes an efficient tabulation and look-up. In this paper we present a novel approach to model the multi-dimensional combustion manifold. We approximate the combustion manifold using a neural network function approximator and use it to predict the temperature and composition of the reaction. We present a novel training procedure which is developed to generate a smooth output curve for temperature over the course of a reaction. We then evaluate our work against the current approach of tabulation with linear interpolation in combustion simulations. We also provide an ablation study of our training procedure in the context of over-fitting in our model. The combustion dataset used for the modeling of combustion of H2 and O2 in this work isreleased alongside this paper.
picture_as_pdf View full article
CASTNet: Community-Attentive Spatio-Temporal Networks for Opioid Overdose Forecasting (900)
Ali Mert Ertugrul (University of Pittsburgh; Middle East Technical University, Ankara), Yu-Ru Lin (University of Pittsburgh), Tugba Taskaya-Temizel (Middle East Technical University, Ankara)

Opioid overdose is a growing public health crisis in the United States. This crisis, recognized as "opioid epidemic," has widespread societal consequences including the degradation of health, and the increase in crime rates and family problems. To improve the overdose surveillance and to identify the areas in need of prevention effort, in this work, we focus on forecasting opioid overdose using real-time crime dynamics. Previous work identified various types of links between opioid use and criminal activities, such as financial motives and common causes. Motivated by these observations, we propose a novel spatio-temporal predictive model for opioid overdose forecasting by leveraging the spatio-temporal patterns of crime incidents. Our proposed model incorporates multi-head attentional networks to learn different representation subspaces of features. Such deep learning architecture, called "community-attentive" networks, allows the prediction for a given location to be optimized by a mixture of groups (i.e., communities) of regions. In addition, our proposed model allows for interpreting what features, from what communities, have more contributions to predicting local incidents as well as how these communities are captured through forecasting. Our results on two real-world overdose datasets indicate that our model achieves superior forecasting performance and provides meaningful interpretations in terms of spatio-temporal relationships between the dynamics of crime and that of opioid overdose.
picture_as_pdf View full article Reproducible Research
Dynamic Principal Projection for Cost-Sensitive Online Multi-Label Classification (J30)
Hong-Min Chu, Kuan-Hao Huang, Hsuan-Tien Lin


link View full article
Aggregating Algorithm for Prediction of Packs (J02)
Dmitry Adamskiy, Anthony Bellotti, Raisa Dzhamtyrova, Yuri Kalnishkan


link View full article
Efficient Feature Selection Using Shrinkage Estimators (J31)
Konstantinos Sechidis, Laura Azzimonti, Adam Pocock, Giorgio Corani, James Weatherall, Gavin Brown


link View full article
Grouped Gaussian Processes for Solar Power Prediction (J26)
Astrid Dahl, Edwin V. Bonilla


link View full article
LSALSA: Accelerated Source Separation via Learned Sparse Coding (J28)
Benjamin Cowen, Apoorva Nandini Saridena, Anna Choromanska


link View full article
Data scarcity, robustness and extreme multi-label classification (J29)
Rohit Babbar, Bernhard Schölkopf


link View full article
Joint Detection of Malicious Domains and Infected Clients (J19)
Paul Prasse, René Knaebel, Lukáš Machlica, Tomáš Pevný, Tobias Scheffer


link View full article
A Flexible Probabilistic Framework for Large-Margin Mixture of Experts (J08)
Archit Sharma, Siddhartha Saxena, Piyush Rai


link View full article
Deep Collective Matrix Factorization for Augmented Multi-View Learning (J05)
Ragunathan Mariappan, Vaibhav Rajan


link View full article
Temporal Pattern Attention for Multivariate Time Series Forecasting (J32)
Shun-Yao Shih, Fan-Keng Sun, Hung-yi Lee


link View full article
Compatible Natural Gradient Policy Search (J24)
Joni Pajarinen, Hong Linh Thai, Riad Akrour, Jan Peters, Gerhard Neumann


link View full article
TD-Regularized Actor-Critic Methods (J09)
Simone Parisi, Voot Tangkaratt, Jan Peters, Mohammad Emtiyaz Khan


link View full article
On PAC-Bayesian Bounds for Random Forests (J17)
Stephan S. Lorenzen, Christian Igel, Yevgeny Seldin


link View full article
Nuclear Discrepancy for Single-Shot Batch Active Learning (J18)
Tom J. Viering, Jesse H. Krijthe, Marco Loog


link View full article
Efficient learning with robust gradient descent (J14)
Matthew J. Holland, Kazushi Ikeda


link View full article
CaDET: Interpretable Parametric Conditional Density Estimation with Decision Trees and Forests (J07)
Cyrus Cousins, Matteo Riondato


link View full article
On the analysis of adaptability in multi-source domain adaptation (J15)
Ievgen Redko, Amaury Habrard, Marc Sebban


link View full article
The Teaching Size: Computable Teachers and Learners for Universal Languages (J16)
Jan Arne Telle, José Hernández-Orallo, Cèsar Ferri


link View full article
Distribution-Free Uncertainty Quantification for Kernel Methods by Gradient Perturbations (J20)
Balázs Cs. Csáji, Krisztián B. Kis


link View full article
Improving latent variable descriptiveness by modelling rather than ad-hoc factors (J06)
Alex Mansbridge, Roberto Fierimonte, Ilya Feige, David Barber


link View full article
Classification with Label Noise: A Markov Chain Sampling Framework (J21)
Zijin Zhao, Lingyang Chu, Dacheng Tao, Jian Pei


link View full article
Finding lasting dense graphs (J10)
Konstantinos Semertzidis, Evaggelia Pitoura, Evimaria Terzi, Panayiotis Tsaparas


link View full article
Model-Free Inference of Diffusion Networks (J11)
Shoubo Hu, Bogdan Cautis, Zhitang Chen, Laiwan Chan, Yanhui Geng, Xiuqiang He


link View full article
Thu, 15:40 - 16:00 @ 0.001 (Poster@Thu) Clustering
Noise-free Latent Block Model for High Dimensional Data (J25)
Charlotte Laclau, Vincent Brault


link View full article
Deeply Supervised Model for Click-Through Rate Prediction in Sponsored Search (J03)
Jelena Gligorijevic, Djordje Gligorijevic, Ivan Stojkovic, Xiao Bai, Amit Goyal, Zoran Obradovic


link View full article
Robust active attacks on social graphs (J13)
Sjouke Mauw, Yunior Ramírez-Cruz, Rolando Trujillo-Rasua


link View full article
Temporal Density Extrapolation using a Dynamic Basis Approach (J01)
G. Krempl, D. Lang, V. Hofer


link View full article
Counts-of-Counts Similarity for prediction and search in relational data (J12)
Manfred Jaeger, Marco Lippi, Giovanni Pellegrini, Andrea Passerini


link View full article
Mining Skypatterns in Fuzzy Tensors (J23)
Nicolas Nadisic, Aurélien Coussat, Loïc Cerf


link View full article
Multi-location Visibility Query Processing Using Portion-based Transactional Modeling and Pattern Mining (J22)
Lakshmi Gangumalla, P. Krishna Reddy, Anirban Mondal


link View full article
Stochastic Gradient Hamiltonian Monte Carlo with Variance reduction for Bayesian inference (J27)
Zhize Li, Tianyi Zhang, Shuyu Cheng, Jun Zhu, Jian Li


link View full article
Tue, 15:40 - 16:00 @ 1.011 (Poster@Wed) Ranking
Rankboost+: An Improvement to Rankboost (J04)
Harold Connamacher (Case Western Reserve University), Nikil Pancha (Case Western Reserve University), Rui Liu (Case Western Reserve University), Soumya Ray (Case Western Reserve University)


link View full article
Deep Ordinal Reinforcement Learning (17)
Alexander Zap (TU Darmstadt), Tobias Joppen (TU Darmstadt), Johannes Fürnkranz (TU Darmstadt)

Reinforcement learning usually makes use of numerical rewards, which have nice properties but also come with drawbacks and difficulties.Using rewards on an ordinal scale (ordinal rewards) is an alternative to numerical rewards that has received more attention in recent years.In this paper, a general approach to adapting reinforcement learning problems to the use of ordinal rewards is presented and motivated.We show how to convert common reinforcement learning algorithms to an ordinal variation by the example of Q-learning and introduce Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal rewards.Additionally, we run evaluations on problems provided by the OpenAI Gym framework, showing that our ordinal variants exhibit a performance that is comparable tothe numerical variations for a number of problems.We also give first evidence that our ordinal variant is able to produce better results for problems with less engineered and simpler-to-design reward signals.
picture_as_pdf View full article Reproducible Research
Importance Weighted Generative Networks (21)
Maurice Diesendruck (The University of Texas at Austin), Ethan R. Elenberg (ASAPP, Inc.), Rajat Sen (Amazon, Inc.), Guy W. Cole (The University of Texas at Austin), Sanjay Shakkottai (The University of Texas at Austin), Sinead A. Williamson (The University of Texas at Austin; CognitiveScale)

While deep generative networks can simulate from complex data distributions, their utility can be hindered by limitations on the data available for training. Specifically, the training data distribution may differ from the target sampling distribution due to sample selection bias, or because the training data comes from a different but related distribution. We present methods to accommodate this difference via importance weighting, which allow us to estimate a loss function with respect to a target distribution even if we cannot access that distribution directly. These estimators, which differentially weight the contribution of data to the loss function, offer theoretical guarantees that heuristic approaches lack, while giving impressive empirical performance in a variety of settings.
picture_as_pdf View full article Reproducible Research
User-Guided Clustering in Heterogeneous Information Networks via Motif-Based Comprehensive Transcription (40)
Yu Shi (University of Illinois at Urbana-Champaign), Xinwei He (University of Illinois at Urbana-Champaign), Naijing Zhang (University of Illinois at Urbana-Champaign), Carl Yang (University of Illinois at Urbana-Champaign), Jiawei Han (University of Illinois at Urbana-Champaign)

Heterogeneous information networks (HINs) with rich semantics are ubiquitous in real-world applications. For a given HIN, many reasonable clustering results with distinct semantic meaning can simultaneously exist.User-guided clustering is hence of great practical value for HINs where users provide labels to a small portion of nodes.To cater to a broad spectrum of user guidance evidenced by different expected clustering results, carefully exploiting the signals residing in the data is potentially useful.Meanwhile, as one type of complex networks, HINs often encapsulate higher-order interactions that reflect the interlocked nature among nodes and edges.Network motifs, sometimes referred to as meta-graphs, have been used as tools to capture such higher-order interactions and reveal the many different semantics.We therefore approach the problem of user-guided clustering in HINs with network motifs.In this process, we identify the utility and importance of directly modeling higher-order interactions without collapsing them to pairwise interactions.To achieve this, we comprehensively transcribe the higher-order interaction signals to a series of tensors via motifs and propose the MoCHIN model based on joint non-negative tensor factorization.This approach applies to arbitrarily many, arbitrary forms of HIN motifs. An inference algorithm with speed-up methods is also proposed to tackle the challenge that tensor size grows exponentially as the number of nodes in a motif increases.We validate the effectiveness of the proposed method on two real-world datasets and three tasks, and MoCHIN outperforms all baselines in three evaluation tasks under three different metrics.Additional experiments demonstrated the utility of motifs and the benefit of directly modeling higher-order information especially when user guidance is limited.
picture_as_pdf View full article Reproducible Research
Tue, 14:40 - 15:00 @ 1.011 (Poster@Wed) Ranking
A Ranking Model Motivated by Nonnegative Matrix Factorization with Applications to Tennis Tournaments (44)
Rui Xia (National University of Singapore), Vincent Y. F. Tan (National University of Singapore), Louis Filstroff (IRIT, Université de Toulouse), Cédric Févotte (IRIT, Université de Toulouse)

We propose a novel ranking model that combines the Bradley-Terry-Luce probability model with a nonnegative matrix factorization framework to model and uncover the presence of latent variables that influence the performance of top tennis players. We derive an efficient, provably convergent, and numerically stable majorization-minimization-based algorithm to maximize the likelihood of datasets under the proposed statistical model. The model is tested on datasets involving the outcomes of matches between 20 top male and female tennis players over 14 major tournaments for men (including the Grand Slams and the ATP Masters 1000) and 16 major tournaments for women over the past 10 years. Our model automatically infers that the surface of the court (e.g., clay or hard court) is a key determinant of the performances of male players, but less so for females. Top players on various surfaces over this longitudinal period are also identified in an objective manner.
picture_as_pdf View full article Reproducible Research
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics (48)
Denis Steckelmacher (Vrije Universiteit Brussel), Hélène Plisnier (Vrije Universiteit Brussel), Diederik M. Roijers (VU Amsterdam), Ann Nowé (Vrije Universiteit Brussel)

Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks.
picture_as_pdf View full article Reproducible Research
Distributed Learning of Non-Convex Linear Models with One Round of Communication (55)
Mike Izbicki (Claremont McKenna College), Christian R. Shelton (UC Riverside)

We present the optimal weighted average (OWA) distributed learning algorithm for linear models. OWA achieves statistically optimal learning rates,uses only one round of communication,works on non-convex problems,and supports a fast cross validation procedure.The OWA algorithm first trains local models on each of the compute nodes;then a master machine merges the models using a second round of optimization. This second optimization uses only a small fraction of the data, and so has negligible computational cost.Compared with similar distributed estimators that merge locally trained models,OWA either has stronger statistical guarantees, is applicable to more models,or has a more computationally efficient merging procedure.
picture_as_pdf View full article
DEvIANT: Discovering Significant Exceptional (Dis-)Agreement Within Groups (58)
Adnene Belfodil (INSA Lyon), Wouter Duivesteijn (Technische Universiteit Eindhoven), Marc Plantevit (Univ Lyon), Sylvie Cazalens (INSA Lyon), Philippe Lamarre (INSA Lyon)

We strive to find contexts (i.e., subgroups of entities) under which exceptional (dis-)agreement occurs among a group of individuals, in any type of data featuring individuals (e.g., parliamentarians, customers) performing observable actions (e.g., votes, ratings) on entities (e.g., legislative procedures, movies). To this end, we introduce the problem of discovering statistically significant exceptional contextual intra-group agreement patterns. To handle the sparsity inherent to voting and rating data, we use Krippendorff's Alpha measure for assessing the agreement among individuals. We devise a branch-and-bound algorithm, named DEvIANT, to discover such patterns. DEvIANT exploits both closure operators and tight optimistic estimates. We derive analytic approximations for the confidence intervals (CIs) associated with patterns for a computationally efficient significance assessment. We prove that these approximate CIs are nested along specialization of patterns. This allows to incorporate pruning properties in DEvIANT to quickly discard non-significant patterns. Empirical study on several datasets demonstrates the efficiency and the usefulness of DEvIANT.
picture_as_pdf View full article Reproducible Research
A Framework for Deep Constrained Clustering - Algorithms and Advances (62)
Hongjing Zhang (University of California, Davis), Sugato Basu (Google Research), Ian Davidson (University of California, Davis)

The area of constrained clustering has been extensively explored by researchers and used by practitioners. Constrained clustering formulations exist for popular algorithms such as k-means, mixture models, and spectral clustering but have several limitations. A fundamental strength of deep learning is its flexibility, and here we explore a deep learning framework for constrained clustering and in particular explore how it can extend the field of constrained clustering. We show that our framework can not only handle standard together/apart constraints (without the well documented negative effects reported earlier) generated from labeled side information but more complex constraints generated from new types of side information such as continuous values and high-level domain knowledge.
picture_as_pdf View full article Reproducible Research
Exploiting the Earth's Spherical Geometry to Geolocate Images (63)
Mike Izbicki (Claremont McKenna College), Evangelos E. Papalexakis (University of California Riverside), Vassilis J. Tsotras (University of California Riverside)

Existing methods for geolocating images use standard classification or image retrieval techniques. These methods have poor theoreticalproperties because they do not take advantage of the earth'sspherical geometry. In some cases, they require training data sets thatgrow exponentially with the number of feature dimensions. This paperintroduces the first image geolocation method that exploits the earth'sspherical geometry. Our method is based on the Mixture of von-MisesFisher (MvMF) distribution, which is a spherical analogue of the popularGaussian mixture model. We prove that this method requires only adataset of size linear in the number of feature dimensions, and empiricalresults show that our method outperforms previous methods with ordersof magnitude less training data and computation.
picture_as_pdf View full article Reproducible Research
Unsupervised Sentence Embedding Using Document Structure-based Context (66)
Taesung Lee (IBM T.J. Watson Research Center), Youngja Park (IBM T.J. Watson Research Center)

We present a new unsupervised method for learning general-purpose sentence embeddings.Unlike existing methods which rely on local contexts,such as words inside the sentence or immediately neighboring sentences,our method selects, for each target sentence,influential sentences from the entire document based on the document structure.We identify a dependency structure of sentences using metadata and text styles.Additionally, we propose an out-of-vocabulary word handling techniquefor the neural network outputs to model many domain-specific termswhich were mostly discarded by existing sentence embedding training methods.We empirically show that the model relies on the proposed dependenciesmore than the sequential dependency in many cases.We also validate our model on several NLP tasksshowing 23
picture_as_pdf View full article
String Sanitization: A Combinatorial Approach (73)
Giulia Bernardini (University of Milano-Bicocca), Huiping Chen (King's College London), Alessio Conte (University of Pisa), Roberto Grossi (University of Pisa; INRIA, Lyon), Grigorios Loukides (King's College London), Nadia Pisanti (University of Pisa; INRIA, Lyon), Solon P. Pissis (INRIA, Lyon; CWI, Amsterdam), Giovanna Rosone (University of Pisa)

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user's location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility. First, we propose a time-optimal algorithm, TFS-ALGO, to construct the shortest string preserving the order of appearance and the frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. Second, we propose a time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms may reveal the location of sensitive patterns. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in these strings with carefully selected letters, so that sensitive patterns are not reinstated and occurrences of spurious patterns are prevented. We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO and then MCSR-ALGO and experimentally show that it is effective and efficient.
picture_as_pdf View full article Reproducible Research
A Differentially Private Kernel Two-Sample Test (85)
Anant Raj (Max Planck Institute for Intelligent Systems, Tübingen), Ho Chung Leon Law (University of Oxford), Dino Sejdinovic (University of Oxford), Mijung Park (Max Planck Institute for Intelligent Systems, Tübingen)

Kernel two-sample testing is a useful statistical tool in determining whether data samples arise from different distributions without imposing any parametric assumptions on those distributions. However, raw data samples can expose sensitive information about individuals who participate in scientific studies, which makes the current tests vulnerable to privacy breaches. Hence, we design a new framework for kernel two-sample testing conforming to differential privacy constraints, in order to guarantee the privacy of subjects in the data. Unlike existing differentially private parametric tests that simply add noise to data, kernel-based testing imposes a challenge due to a complex dependence of test statistics on the raw data, as these statistics correspond to estimators of distances between representations of probability measures in Hilbert spaces. Our approach considers finite dimensional approximations to those representations. As a result, a simple chi-squared test is obtained, where a test statistic depends on a mean and covariance of empirical differences between the samples, which we perturb for a privacy guarantee. We investigate the utility of our framework in two realistic settingsand conclude that our method requires only a relatively modest increase in sample size to achieve a similar level of power to the non-private tests in both settings.
picture_as_pdf View full article Reproducible Research
Linearly Constrained Weights: Reducing Activation Shift for Faster Training of Neural Networks (98)
Takuro Kutsuna (Toyota Central R&D Labs. Inc.)

In this paper, we first identify activation shift, a simple but remarkablephenomenon in aneural network in whichthe preactivation value of a neuron has non-zero mean that depends on the anglebetween the weight vector of the neuron and the mean of the activationvector in the previous layer. We then propose linearly constrained weights (LCW) to reducethe activation shift in both fully connected and convolutional layers.The impact of reducing the activation shift in a neural network is studiedfrom the perspective of how the variance of variables in the networkchanges through layer operations in both forward and backward chains.We also discuss its relationship to the vanishing gradient problem.Experimental results show that LCW enables a deep feedforward networkwith sigmoid activation functions to be trained efficientlyby resolving the vanishing gradient problem.Moreover, combined with batch normalization, LCW improvesgeneralization performance of both feedforward andconvolutional networks.
picture_as_pdf View full article
Joint Multi-Source Reduction (102)
Lei Zhang (Institute of Information Engineering, Chinese Academy of Sciences), Shupeng Wang (Institute of Information Engineering, Chinese Academy of Sciences), Xin Jin (National Computer Network Emergency Response Technical Team/Coordination Center of China), Siyu Jia (Institute of Information Engineering, Chinese Academy of Sciences)

The redundant sources problem in multi-source learning always exists in various real-world applications such as multimedia analysis, information retrieval, and medical diagnosis, in which the heterogeneous representations from different sources always have three-way redundancies. More seriously, the redundancies will cost a lot of storage space, cause high computational time, and degrade the performance of learner. This paper is an attempt to jointly reduce redundant sources. Specifically, a novel Heterogeneous Manifold Smoothness Learning (HMSL) model is proposed to linearly map multi-source data to a low-dimensional feature-isomorphic space, in which the information-correlated representations are close along manifold while the semantic-complementary instances are close in Euclidean distance. Furthermore, to eliminate three-way redundancies, we present a new Correlation-based Multi-source Redundancy Reduction (CMRR) method with 2,1-norm equation and generalized elementary transformation constraints to reduce redundant sources in the learned feature-isomorphic space. Comprehensive empirical investigations are presented that confirm the promise of our proposed framework.
picture_as_pdf View full article
Scalable Large Margin Gaussian Process Classification (116)
Martin Wistuba (IBM Research), Ambrish Rawat (IBM Research)

We introduce a new Large Margin Gaussian Process (LMGP) model by formulating a pseudo-likelihood for a generalised multi-class hinge loss.We derive a highly scalable training objective for the proposed model using variational-inference and inducing point approximation.Additionally, we consider the joint learning of LMGP-DNN which combines the proposed model with traditional Deep Learning methods to enable learning for unstructured data.We demonstrate the effectiveness of the Large Margin GP with respect to both training time and accuracy in an extensive classification experiment consisting of 68 structured and two unstructured data sets.Finally, we highlight the key capability and usefulness of our model in yielding prediction uncertainty for classification by demonstrating its effectiveness in the tasks of large-scale active learning and detection of adversarial images.
picture_as_pdf View full article
Adjustment Criteria for Recovering Causal Effects from Missing Data (122)
Mojdeh Saadati (Iowa State University), Jin Tian (Iowa State University)

Confounding bias, missing data, and selection bias are three common obstacles to valid causal inference in the data sciences. Covariate adjustment is the most pervasive technique for recovering casual effects from confounding bias. In this paper we introduce a covariate adjustment formulation for controlling confounding bias in the presence of missing-not-at-random data and develop a necessary and sufficient condition for recovering causal effects using the adjustment. We also introduce an adjustment formulation for controlling both confounding and selection biases in the presence of missing data and develop a necessary and sufficient condition for valid adjustment. Furthermore, we present an algorithm that lists all valid adjustment sets and an algorithm that finds a valid adjustment set containing the minimum number of variables, which are useful for researchers interested in selecting adjustment sets with desired properties.
picture_as_pdf View full article
Copy Mechanism and Tailored Training for Character-based Data-to-text Generation (145)
Marco Roberti (University of Turin), Giovanni Bonetta (University of Turin), Rossella Cancelliere (University of Turin), Patrick Gallinari (Sorbonne Université; Criteo AI Lab)

In the last few years, many different methods have been focusing on using deep recurrent neural networks for natural language generation. The most widely used sequence-to-sequence neural methods are word-based: as such, they need a pre-processing step called delexicalization (conversely, relexicalization) to deal with uncommon or unknown words. These forms of processing, however, give rise to models that depend on the vocabulary used and are not completely neural.In this work, we present an end-to-end sequence-to-sequence model with attention mechanism which reads and generates at a character level, no longer requiring delexicalization, tokenization, nor even lowercasing. Moreover, since characters constitute the common "building blocks" of every text, it also allows a more general approach to text generation, enabling the possibility to exploit transfer learning for training.These skills are obtained thanks to two major features: (*) the possibility to alternate between the standard generation mechanism and a copy one, which allows to directly copy input facts to produce outputs, and(*) the use of an original training pipeline that further improves the quality of the generated texts.We also introduce a new dataset called E2E+, designed to highlight the copying capabilities of character-based models, that is a modified version of the well-known E2E dataset used in the E2E Challenge. We tested our model according to five broadly accepted metrics (including the widely used bleu), showing that it yields competitive performance with respect to both character-based and word-based approaches.
picture_as_pdf View full article Reproducible Research
Thu, 15:00 - 15:20 @ 0.001 (Poster@Thu) Clustering
A Framework for Parallelizing Hierarchical Clustering Methods (149)
Silvio Lattanzi (Google Zürich), Thomas Lavastida (Carnegie Mellon University), Kefu Lu (Washington), Benjamin Moseley (Lee University)

Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include top-down divisive approaches such as bisecting k-means, k-median, and k-center and bottom-up agglomerative approaches such as single-linkage, average-linkage, and centroid-linkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the single-linkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods.We introduce efficient distributed algorithms for bisecting k-means, k-median, and k-center as well as centroid-linkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice.
picture_as_pdf View full article
Continual Rare-Class Recognition with Emerging Novel Subclasses (152)
Hung Nguyen (Carnegie Mellon University), Xuejian Wang (Carnegie Mellon University), Leman Akoglu (Carnegie Mellon University)

Given a labeled dataset that contains a rare (or minority) class of of-interest instances, as well as a large class of instances that are not of interest,how can we learn to recognize future of-interest instances over a continuous stream?We introduce RaRecognize, which (i) estimates a general decision boundary between the rare and the majority class, (ii) learns to recognize individual rare subclasses that exist within the training data, as well as (iii) flags instances from previously unseen rare subclasses as newly emerging.The learner in (i) is general in the sense that by construction it is dissimilar to the specialized learners in (ii), thus distinguishes minority from the majority without overly tuning to what is seen in the training data. Thanks to this generality, RaRecognize ignores all future instances that it labels as majorityand recognizes the recurrent as well as emerging rare subclasses only. This saves effort at test time as well as ensures that the model size grows moderately over time as it only maintains specialized minority learners.Through extensive experiments, we show that RaRecognize outperforms state-of-the art baselines on three real-world datasets that contain corporate-risk and disaster documents as rare classes.
picture_as_pdf View full article Reproducible Research
Unsupervised and Active Learning using Maximin-based Anomaly Detection (161)
Zahra Ghafoori (University of Melbourne), James C. Bezdek (University of Melbourne), Christopher Leckie (University of Melbourne), Shanika Karunasekera (University of Melbourne)

Unsupervised anomaly detection is commonly performed using a distance or density based technique, such as K-Nearest neighbours, Local Outlier Factor or One-class Support Vector Machines. One-class Support Vector Machines reduce the computational cost of testing new data by providing sparse solutions. However, all these techniques have relatively high computational requirements for training. Moreover, identifying anomalies based solely on density or distance is not sufficient when both point (isolated) and cluster anomalies exist in an unlabelled training set. Finally, these unsupervised anomaly detection techniques are not readily adapted for active learning, where the training algorithm should identify examples for which labelling would make a significant impact on the accuracy of the learned model. In this paper, we propose a novel technique called Maximin-based Anomaly Detection that addresses these challenges by selecting a representative subset of data in combination with a kernel-based model construction. We show that the proposed technique (a) provides a statistically significant improvement in the accuracy as well as the computation time required for training and testing compared to several benchmark unsupervised anomaly detection techniques, and (b) effectively uses active learning with a limited budget.
picture_as_pdf View full article Reproducible Research
Integrating Learning and Reasoning with Deep Logic Models (182)
Giuseppe Marra (University of Florence; University of Siena), Francesco Giannini (University of Siena), Michelangelo Diligenti (University of Siena), Marco Gori (University of Siena)

Deep learning is very effective at jointly learning feature representations and classification models, especially when dealing with high dimensional input patterns. Probabilistic logic reasoning, on the other hand, is capable of take consistent and robust decisions in complex environments. The integration of deep learning and logic reasoning is still an open-research problem and it is considered to be the key for the development of real intelligent agents. This paper presents Deep Logic Models, which are deep graphical models integrating deep learning and logic reasoning both for learning and inference. Deep Logic Models create an end-to-end differentiable architecture, where deep learners are embedded into a network implementing a continuous relaxation of the logic knowledge. The learning process allows to jointly learn the weights of the deep learners and the meta-parameters controlling the high-level reasoning. The experimental results show that the proposed methodology overcomes the limitations of the other approaches that have been proposed to bridge deep learning and reasoning.
picture_as_pdf View full article
LYRICS: a General Interface Layer to Integrate Logic Inference and Deep Learning (183)
Giuseppe Marra (University of Florence; University of Siena), Francesco Giannini (University of Siena), Michelangelo Diligenti (University of Siena), Marco Gori (University of Siena)

In spite of the amazing results obtained by deep learning in many applications, a real intelligent behavior of an agent acting in a complex environment is likely to require some kind of higher-level symbolic inference. Therefore, there is a clear need for the definition of a general and tight integration between low-level tasks, processing sensorial data that can be effectively elaborated using deep learning techniques, and the logic reasoning that allows humans to take decisions in complex environments.This paper presents LYRICS, a generic interface layer for AI, which is implemented in TersorFlow (TF). LYRICS provides an input language that allows to define arbitrary First Order Logic (FOL) background knowledge. The predicates and functions of the FOL knowledge can be bound to any TF computational graph, and the formulas are converted into a set of real-valued constraints, which participate to the overall optimization problem. This allows to learn the weights of the learners, under the constraints imposed by the prior knowledge.The framework is extremely general as it imposes no restrictions in terms of which models or knowledge can be integrated. In this paper, we show the generality of the approach showing some use cases of the presented language, including model checking, supervised learning and collective classification.
picture_as_pdf View full article
Online Linear Models for Edge Computing (185)
Hadar Sivan (Technion), Moshe Gabel (University of Toronto), Assaf Schuster (Technion)

Maintaining an accurate trained model on an infinite data stream is challenging due to concept drifts that render a learned model inaccurate.Updating the model periodically can be expensive, and so traditional approaches for computationally limited devices involve a variation of online or incremental learning, which tend to be less robust.The advent of heterogeneous architectures and Internet-connected devices gives rise to a new opportunity.A weak processor can call upon a stronger processor or a cloud server to perform a complete batch training pass once a concept drift is detected - trading power or network bandwidth for increased accuracy.We capitalize on this opportunity in two steps.We first develop a computationally efficient bound for changes in any linear model with convex, differentiable loss.We then propose a sliding window-based algorithm that uses a small number of batch model computations to maintain an accurate model of the data stream. It uses the bound to continuously evaluate the difference between the parameters of the existing model and a hypothetical optimal model, triggering computation only as needed.Empirical evaluation on real and synthetic datasets shows that our proposed algorithm adapts well to concept drifts and provides a better tradeoff between the number of model computations and model accuracy than classic concept drift detectors.When predicting changes in electricity prices, for example, we achieve 6
picture_as_pdf View full article
Novel Dense Subgraph Discovery Primitives: Risk Aversion and Exclusion Queries (192)
Charalampos E. Tsourakakis (Boston University), Tianyi Chen (Boston University), Naonori Kakimura (Keio University), Jakub Pachocki (OpenAI)

In the densest subgraph problem, given an undirected graph G(V,E,w) with non-negative edge weights we are asked to find a set of nodes S⊆ V that maximizes the degree density w(S)/|S|, where w(S) is the sum of the weights of the edges in the graph induced by S. This problem is solvable in polynomial time, and in general is well studied. But what happens when the edge weights are negative? Is the problem still solvable in polynomial time? Also, why should we care about the densest subgraph problem in the presence of negative weights? In this work we answer the aforementioned questions. Specifically, we provide two novel graph mining primitives that are applicable to a wide variety of applications. Our primitives can be used to answer questions such as "how can we find a dense subgraph in Twitter with lots of replies and mentions but no follows?", "how do we extract a dense subgraph with high expected reward and low risk from an uncertain graph"? We formulate both problems mathematically as special instances of dense subgraph discovery in graphs with negative weights. We study the hardness of the problem, and we prove that the problem in general is NP-hard, but we also provide sufficient conditions under which it is poly-time solvable. We design an efficient approximation algorithm that works best in the presence of small negative weights, and an effective heuristic for the more general case. Finally, we perform experiments on various real-world datasets that verify the value of the proposed primitives, and the effectiveness of our proposed algorithms.
picture_as_pdf View full article Reproducible Research
Interpretable Discriminative Dimensionality Reduction and Feature Selection on the Manifold (210)
Babak Hosseini (Bielefeld University), Barbara Hammer (Bielefeld University)

Dimensionality reduction (DR) on the manifold includes effectivemethods which project the data from an implicit relational space ontoa vectorial space. Regardless of the achievements in this area, these algorithmssuffer from the lack of interpretation of the projection dimensions.Therefore, it is often difficult to explain the physical meaning behindthe embedding dimensions. In this research, we propose the interpretablekernel DR algorithm (I-KDR) as a new algorithm which maps the datafrom the feature space to a lower dimensional space where the classes aremore condensed with less overlapping. Besides, the algorithm creates thedimensions upon local contributions of the data samples, which makes iteasier to interpret them by class labels. Additionally, we efficiently fusethe DR with feature selection task to select the most relevant features ofthe original space to the discriminative objective. Based on the empiricalevidence, I-KDR provides better interpretations for embedding dimensionsas well as higher discriminative performance in the embedded spacecompared to the state-of-the-art and popular DR algorithms.
picture_as_pdf View full article Reproducible Research
On the Stability of Feature Selection in the Presence of Feature Correlations (211)
Konstantinos Sechidis (University of Manchester), Konstantinos Papangelou (University of Manchester), Sarah Nogueira (Criteo, Paris), James Weatherall (Advanced Analytics Centre, Global Medicines Development, AstraZeneca, Cambridge), Gavin Brown (University of Manchester)

Feature selection is central to modern data science. The `stability' of a feature selection algorithm refers to the sensitivity of its choices to small changes in training data. This is, in effect, the robustness of the chosen features. This paper considers the estimation of stability when we expect strong pairwise correlations, otherwise known as feature redundancy. We demonstrate that existing measures are inappropriate here, as they systematically underestimate the true stability, giving an overly pessimistic view of a feature set.We propose a new statistical measure which overcomes this issue, and generalises previous work.
picture_as_pdf View full article Reproducible Research
The Elliptical Basis Function Data Descriptor (EBFDD) Network - A One-Class Classification Approach to Anomaly Detection (212)
Mehran H. Z. Bazargani (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin), Brian Mac Namee (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin)

This paper introduces the Elliptical Basis Function Data Descriptor (EBFDD) network, a one-class classification approach to anomaly detection based on Radial Basis Function (RBF) neural networks. The EBFDD network uses elliptical basis functions, which allows it to learn sophisticated decision boundaries while retaining the advantages of a shallow network. We have proposed a novel cost function, whose minimisation results in a trained anomaly detector that only requires examples of the normal class at training time. The paper includes a large benchmark experiment that evaluates the performance of EBFDD network and compares it to state of the art one-class classification algorithms including the One-Class Support Vector Machine and the Isolation Forest. The experiments show that, overall, the EBFDD network outperforms the state of the art approaches.
picture_as_pdf View full article Reproducible Research
Learning 3D Navigation Protocols on Touch Interfaces with Cooperative Multi-Agent Reinforcement Learning (213)
Quentin Debard (Itekube, LIRIS), Jilles Steeve Dibangoye (Inria, CITI-Lab, INSA-Lyon), Stéphane Canu (LITIS, INSA-Rouen), Christian Wolf (LIRIS, Inria, CITI-Lab, INSA-Lyon)

Using touch devices to navigate in virtual 3D environments such as computer assisted design (CAD) models or geographical information systems (GIS) is inherently difficult for humans, as the 3D operations have to be performed by the user on a 2D touch surface. This ill-posed problem is classically solved with a fixed and handcrafted interaction protocol, which must be learned by the user. We propose to automatically learn a new interaction protocol allowing to map a 2D user input to 3D actions in virtual environments using reinforcement learning (RL). A fundamental problem of RL methods is the vast amount of interactions often required, which are difficult to come by when humans are involved. To overcome this limitation, we make use of two collaborative agents. The first agent models the human by learning to perform the 2D finger trajectories. The second agent acts as the interaction protocol, interpreting and translating to 3D operations the 2D finger trajectories from the first agent. We restrict the learned 2D trajectories to be similar to a training set of collected human gestures by first performing state representation learning, prior to reinforcement learning. This state representation learning is addressed by projecting the gestures into a latent space learned by a variational auto encoder (VAE).
picture_as_pdf View full article
Unjustified Classification Regions and Counterfactual Explanations In Machine Learning (226)
Thibault Laugel (Sorbonne Université), Marie-Jeanne Lesot (Sorbonne Université), Christophe Marsala (Sorbonne Université), Xavier Renard (AXA, Paris), Marcin Detyniecki (Sorbonne Université; AXA, Paris; Polish Academy of Science)

Post-hoc interpretability approaches, although powerful tools to generate explanations for predictions made by a trained black-box model, have been shown to be vulnerable to issues caused by lack of robustness of the classifier. In particular, this paper focuses on the notion of explanation justification, defined as connectedness to ground-truth data, in the context of counterfactuals. In this work, we explore the extent of the risk of generating unjustified explanations. We propose an empirical study to assess the vulnerability of classifiers and show that the chosen learning algorithm heavily impacts the vulnerability of the model. Additionally, we show that state-of-the-art post-hoc counterfactual approaches can minimize the impact of this risk by generating less local explanations.
picture_as_pdf View full article Reproducible Research
Deep Eyedentification: Biometric Identification using Micro-Movements of the Eye (231)
Lena A. Jäger (University of Potsdam), Silvia Makowski (University of Potsdam), Paul Prasse (University of Potsdam), Sascha Liehr (Independent researcher), Maximilian Seidler (University of Potsdam), Tobias Scheffer (University of Potsdam)

We study involuntary micro-movements of the eye for biometric identification. While prior studies extract lower-frequency macro-movements from the output of video-based eye-tracking systems and engineer explicit features of these macro-movements, we develop a deep convolutional architecture that processes the raw eye-tracking signal. Compared to prior work, the network attains a lower error rate by one order of magnitude and is faster by two orders of magnitude: it identifies users accurately within seconds.
picture_as_pdf View full article Reproducible Research
SLSGD: Secure and Efficient Distributed On-device Machine Learning (239)
Cong Xie (University of Illinois at Urbana-Champaign), Oluwasanmi Koyejo (University of Illinois at Urbana-Champaign), Indranil Gupta (University of Illinois at Urbana-Champaign)

We consider distributed on-device learning with limited communication and security requirements. We propose a new robust distributed optimization algorithm with efficient communication and attack tolerance. The proposed algorithm has provable convergence and robustness under non-IID settings. Empirical results show that the proposed algorithm stabilizes the convergence and tolerates data poisoning on a small number of workers.
picture_as_pdf View full article Reproducible Research
Neural Control Variates for Monte Carlo Variance Reduction (265)
Ruosi Wan (Peking University), Mingjun Zhong (University of Lincoln), Haoyi Xiong (Baidu Inc.), Zhanxing Zhu (Peking University; Beijing Institute of Big Data Research)

In statistics and machine learning, approximation of an intractable integration is often achieved by using the unbiased Monte Carlo estimator, but the variances of the estimation are generally high in many applications. Control variates approaches are well-known to reduce the variance of the estimation. These control variates are typically constructed by employing predefined parametric functions or polynomials, determined by using those samples drawn from the relevant distributions. Instead, we propose to construct those control variates by learning neural networks to handle the cases when test functions are complex. In many applications, obtaining a large number of samples for Monte Carlo estimation is expensive, the adoption of the original loss function may result in severe overfitting when training a neural network. This issue was not reported in those literature on control variates with neural networks. We thus further introduce a constrained control variates with neural networks to alleviate the overfitting issue. We apply the proposed control variates to both toy and real data problems, including a synthetic data problem, Bayesian model evidence evaluation and Bayesian neural networks. Experimental results demonstrate that our method can achieve significant variance reduction compared to other methods.
picture_as_pdf View full article
Node Representation Learning for Directed Graphs (275)
Megha Khosla (L3S Resaerch Center, Hannover), Jurek Leonhardt (L3S Resaerch Center, Hannover), Wolfgang Nejdl (L3S Resaerch Center, Hannover), Avishek Anand (L3S Resaerch Center, Hannover)

We propose a novel approach for learning node representations in directed graphs, which maintains separate views or embedding spaces for the two distinct node roles induced by the directionality of the edges.We argue that the previous approaches either fail to encode the edge directionality or their encodings cannot be generalized across tasks.With our simple alternating random walk strategy, we generate role specific vertex neighborhoods and train node embeddings in their corresponding source/target roles while fully exploiting the semantics of directed graphs. We also unearth the limitations of evaluations on directed graphs in previous works and propose a clear strategy for evaluating link prediction and graph reconstruction in directed graphs. We conduct extensive experiments to showcase our effectiveness on several real-world datasets on link prediction, node classification and graph reconstruction tasks. We show that the embeddings from our approach are indeed robust, generalizable and well performing across multiple kinds of tasks and graphs. We show that we consistently outperform all baselines for node classification task. In addition to providing a theoretical interpretation of our method we also show that we are considerably more robust than the other directed graph approaches.
picture_as_pdf View full article Reproducible Research
Adversarial Invariant Feature Learning with Accuracy Constraint for Domain Generalization (278)
Kei Akuzawa (University of Tokyo), Yusuke Iwasawa (University of Tokyo), Yutaka Matsuo (University of Tokyo)

Learning domain-invariant representation is a dominant approach for domain generalization (DG), where we need to build a classifier that is robust toward domain shifts.However, previous domain-invariance-based methods overlooked the underlying dependency of classes on domains, which is responsible for the trade-off between classification accuracy and domain invariance.Because the primary purpose of DG is to classify unseen domains rather than the invariance itself, the improvement of the invariance can negatively affect DG performance under this trade-off.To overcome the problem, this study first expands the analysis of the trade-off by Xie et. al., and provides the notion of accuracy-constrained domain invariance, which means the maximum domain invariance within a range that does not interfere with accuracy.We then propose a novel method adversarial feature learning with accuracy constraint (AFLAC), which explicitly leads to that invariance on adversarial training.Empirical validations show that the performance of AFLAC is superior to that of domain-invariance-based methods on both synthetic and three real-world datasets, supporting the importance of considering the dependency and the efficacy of the proposed method.
picture_as_pdf View full article Reproducible Research
Data Association with Gaussian Processes (286)
Markus Kaiser (Siemens AG; Technical University of Munich), Clemens Otte (Siemens AG), Thomas A. Runkler (Siemens AG; Technical University of Munich), Carl Henrik Ek (University of Bristol)

The data association problem is concerned with separating data coming from different generating processes, for example when data comes from different data sources, contain significant noise, or exhibit multimodality.We present a fully Bayesian approach to this problem.Our model is capable of simultaneously solving the data association problem and the induced supervised learning problem.Underpinning our approach is the use of Gaussian process priors to encode the structure of both the data and the data associations.We present an efficient learning scheme based on doubly stochastic variational inference and discuss how it can be applied to deep Gaussian process priors.
picture_as_pdf View full article
PP-PLL: Probability Propagation for Partial Label Learning (296)
Kaiwei Sun (Chongqing University of Posts), Zijian Min (Telecommunications)

Partial label learning (PLL) is a weakly supervised learning framework which learns from the data where each example is associated with a set of candidate labels, among which only one is correct. Most existing approaches are based on the disambiguation strategy, which either identifies the valid label iteratively or treats each candidate label equally based on the averaging strategy. In both cases, the disambiguation strategy shares a common shortcoming that the ground-truth label may be overwhelmed by the false positive candidate labels, especially when the number of candidate labels becomes large. In this paper, a probability propagation method for partial label learning (PP-PLL) is proposed. Specifically, based on the manifold assumption, a biconvex regular function is proposed to model the linear mapping relationships between input features and output true labels. In PP-PLL, the topological relations among training samples are used as additional information to strengthen the mutual exclusiveness among candidate labels, which helps to prevent the ground-truth label from being overwhelmed by a large number of candidate labels. Experimental studies on both artificial and real-world data sets demonstrate that the proposed PP-PLL method can achieve superior or comparable performance against the state-of-the-art methods.
picture_as_pdf View full article Reproducible Research
Quantile Layers: Statistical Aggregation in Deep Neural Networks for Eye Movement Biometrics (321)
Ahmed Abdelwahab (Leibniz Institute of Agricultural Engineering), Niels Landwehr (Bioeconomy e.V. (ATB), Potsdam)

Human eye gaze patterns are highly individually characteristic. Gaze patterns observed during the routine access of a user to a device or document can therefore be used to identify subjects unobtrusively,that is, without the need to perform an explicit verification such as entering a password.Existing approaches to biometric identification from gaze patterns segment raw gaze data into short, local patterns called saccades and fixations. Subjects are then identified by characterizing the distribution of these patterns or deriving hand-crafted features for them.In this paper, we follow a different approach by training deep neural networks directly on the raw gaze data.As the distribution of short, local patterns has been shown to be particularly informative for distinguishing subjects, we introduce a parameterized and end-to-end learnable statistical aggregation layer called the quantile layer that enables the network to explicitly fit the distribution offilter activations in preceeding layers. We empirically show that deep neural networks with quantile layers outperform existing probabilistic and feature-based methods for identifying subjectsbased on eye movements by a large margin.
picture_as_pdf View full article Reproducible Research
Thu, 15:20 - 15:40 @ 0.001 (Poster@Thu) Clustering
Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations (327)
Dmitry Kobak (University of Tübingen), George Linderman (Yale University), Stefan Steinerberger (Yale University), Yuval Kluger (Yale University; Yale School of Medicine), Philipp Berens (University of Tübingen)

T-distributed stochastic neighbour embedding (t-SNE) is a widely used data visualisation technique. It differs from its predecessor SNE by the low-dimensional similarity kernel: the Gaussian kernel was replaced by the heavy-tailed Cauchy kernel, solving the `crowding problem' of SNE. Here, we develop an efficient implementation of t-SNE for a t-distribution kernel with an arbitrary degree of freedom ν, with ν→∞ corresponding to SNE and ν=1 corresponding to the standard t-SNE. Using theoretical analysis and toy examples, we show that ν<1 can further reduce the crowding problem and reveal finer cluster structure that is invisible in standard t-SNE. We further demonstrate the striking effect of heavier-tailed kernels on large real-life data sets such as MNIST, single-cell RNA-sequencing data, and the HathiTrust library. We use domain knowledge to confirm that the revealed clusters are meaningful. Overall, we argue that modifying the tail heaviness of the t-SNE kernel can yield additional insight into the cluster structure of the data.
picture_as_pdf View full article Reproducible Research
Thu, 14:40 - 15:00 @ 0.001 (Poster@Thu) Clustering
Uncovering Hidden Block Structure for Clustering (337)
Luce le Gorrec (Université de Toulouse), Sandrine Mouysset (Université de Toulouse), Philip A. Knight (STFC Rutherford Appleton Laboratory; CERFACS, Toulouse), Iain S. Duff (University of Strathclyde), Daniel Ruiz (Université de Toulouse)

We present a multistage procedure to cluster directed and undirected weighted graphs by finding the block structure of their adjacency matrices. A central part of the process is to scale the adjacency matrix into a doubly-stochastic form, which permits detection of the whole matrix block structure with minimal spectral information (theoretically a single pair of singular vectors suffices).We present the different stages of our method, namely the impact of the doubly-stochastic scaling on singular vectors, detection of the block structure by means of these vectors, and details such as cluster refinement and a stopping criterion. Then we test thealgorithm's effectiveness by using it on two unsupervised classification tasks: community detection in networks and shape detection in cloudsof points in two dimensions. By comparing results of our approach with thoseof widely used algorithms designed for specific purposes, we observe that our method is competitive (for community detection) if not superior (for shape detection) in comparison with existing methods.
picture_as_pdf View full article Reproducible Research
Safe Policy Improvement with Soft Baseline Bootstrapping (339)
Kimia Nadjahi (Télécom Paris), Romain Laroche (Microsoft Research Montréal), Rémi Tachet des Combes (Microsoft Research Montréal)

Batch Reinforcement Learning (Batch RL) consists in training a policy using trajectories collected with another policy, called the behavioural policy. Safe policy improvement (SPI) provides guarantees with high probability that the trained policy performs better than the behavioural policy, also called baseline in this setting. Previous work shows that the SPI objective improves mean performance as compared to using the basic RL objective, which boils down to solving the MDP with maximum likelihood. Here, we build on that work and improve more precisely the SPI with Baseline Bootstrapping algorithm (SPIBB) by allowing the policy search over a wider set of policies. Instead of binarily classifying the state-action pairs into two sets (the uncertain and the safe-to-train-on ones), we adopt a softer strategy that controls the error in the value estimates by constraining the policy change according to the local model uncertainty. The method can take more risks on uncertain actions all the while remaining provably-safe, and is therefore less conservative than the state-of-the-art methods. We propose two algorithms (one optimal and one approximate) to solve this constrained optimization problem and empirically show a significant improvement over existing SPI algorithms both on finite MDPS and on infinite MDPs with a neural network function approximation.
picture_as_pdf View full article Reproducible Research
NSEEN: Neural Semantic Embedding for Entity Normalization (383)
Shobeir Fakhraei (University of Southern California), Joel Mathew (University of Southern California), José Luis Ambite (University of Southern California)

Much of human knowledge is encoded in text, available in scientific publications, books, and the web. Given the rapid growth of these resources, we need automated methods to extract such knowledge into machine-processable structures, such as knowledge graphs. An important task in this process is entity normalization, which consists of mapping noisy entity mentions in text to canonical entities in well-known reference sets. However, entity normalization is a challenging problem; there often are many textual forms for a canonical entity that may not be captured in the reference set, and entities mentioned in text may include many syntactic variations, or errors. The problem is particularly acute in scientific domains, such as biology. To address this problem, we have developed a general, scalable solution based on a deep Siamese neural network model to embed the semantic information about the entities, as well as their syntactic variations. We use these embeddings for fast mapping of new entities to large reference sets, and empirically show the effectiveness of our framework in challenging bio-entity normalization datasets.
picture_as_pdf View full article Reproducible Research
Tue, 15:00 - 15:20 @ 1.011 (Poster@Wed) Ranking
A Reduction of Label Ranking to Multiclass Classification (385)
Klaus Brinker (Hamm-Lippstadt University of Applied Sciences), Eyke Hüllermeier (Paderborn University)

Label ranking considers the problem of learning a mapping from instances to strict total orders over a predefined set of labels. In this paper, we present a framework for label ranking using a decomposition into a set of multiclass problems. Conceptually, our approach can be seen as a generalization of pairwise preference learning. In contrast to the latter, it allows for controlling the granularity of the decomposition, varying between binary preferences and complete rankings as extreme cases. It is specifically motivated by limitations of pairwise learning with regard to the minimization of certain loss functions. We discuss theoretical properties of the proposed method in terms of accuracy, error correction, and computational complexity. Experimental results are promising and indicate that improvements upon the special case of pairwise preference decomposition are indeed possible.
picture_as_pdf View full article
Tue, 14:20 - 14:40 @ 1.011 (Poster@Wed) Ranking
Learning to Calibrate and Rerank Multi-label Predictions (391)
Cheng Li (Northeastern University), Virgil Pavlu (Northeastern University), Javed Aslam (Northeastern University), Bingyu Wang (Northeastern University), Kechen Qin (Northeastern University)

A multi-label classifier assigns a set of labels to each data object. A natural requirement in many end-use applications is that the classifier also provides a well-calibrated confidence (probability) to indicate the likelihood of the predicted set being correct; for example, an application may automate high-confidence predictions while manually verifying low-confidence predictions. The simplest multi-label classifier, called Binary Relevance (BR), applies one binary classifier to each label independently and takes the product of the individual label probabilities as the overall label-set probability (confidence). Despite its many known drawbacks, such as generating suboptimal predictions and poorly calibrated confidence scores, BR is widely used in practice due to its speed and simplicity. We seek in this work to improve both BR's confidence estimation and prediction through a post calibration and reranking procedure. We take the BR predicted set of labels and its product score as features, extract more features from the prediction itself to capture label constraints, and apply Gradient Boosted Trees (GB) as a calibrator to map these features into a calibrated confidence score. GB not only produces well-calibrated scores (aligned with accuracy and sharp), but also models label interactions, correcting a critical flaw in BR. We further show that reranking label sets by the new calibrated confidence makes accurate set predictions on par with state-of-the-art multi-label classifiers - yet calibrated, simpler, and faster.
picture_as_pdf View full article Reproducible Research
Tue, 14:00 - 14:20 @ 1.011 (Poster@Wed) Ranking
Pairwise Learning to Rank by Neural Networks Revisited: Reconstruction, Theoretical Analysis and Practical Performance (400)
Marius Köppel (Johannes Gutenberg-Universität Mainz), Alexander Segner (Johannes Gutenberg-Universität Mainz), Martin Wagener (Johannes Gutenberg-Universität Mainz), Lukas Pensel (Johannes Gutenberg-Universität Mainz), Andreas Karwath (University of Birmingham), Stefan Kramer (Johannes Gutenberg-Universität Mainz)

We present a pairwise learning to rank approach based on a neural net, called DirectRanker, that generalizes the RankNet architecture. We show mathematically that our model is reflexive, antisymmetric, and transitive allowing for simplified training and improved performance. Experimental results on the LETOR MSLR-WEB10K, MQ2007 and MQ2008 datasets show that our model outperforms numerous state-of-the-art methods, while being inherently simpler in structure and using a pairwise approach only.
picture_as_pdf View full article Reproducible Research
Practical Open-Loop Optimistic Planning (407)
Edouard Leurent (SequeL team, INRIA Lille - Nord Europe; Renault Group), Odalric-Ambrym Maillard (SequeL team, INRIA Lille - Nord Europe)

We consider the problem of online planning in a Markov Decision Process when given only access to a generative model, restricted to open-loop policies - i.e. sequences of actions - and under budget constraint. In this setting, the Open-Loop Optimistic Planning (OLOP) algorithm enjoys good theoretical guarantees but is overly conservative in practice, as we show in numerical experiments. We propose a modified version of the algorithm with tighter upper-confidence bounds, KL-OLOP, that leads to better practical performances while retaining the sample complexity bound. Finally, we propose an efficient implementation that significantly improves the time complexity of both algorithms.
picture_as_pdf View full article Reproducible Research
Link Prediction via Higher-Order Motif Features (408)
Ghadeer Abuoda (College of Science & Engineering, HBKU, Doha), Gianmarco De Francisci Morales (ISI Foundation, Turin), Ashraf Aboulnaga (Qatar Computing Research Institute, Doha)

Link prediction requires predicting which new links are likely to appear in a graph.In this paper, we present an approach for link prediction that relies on higher-order analysis of the graph topology, well beyond the typical approach which relies on common neighbors.We treat the link prediction problem as a supervised classification problem,and we propose a set of features that depend on the patterns or motifs that a pair of nodes occurs in.By using motifs of sizes 3, 4, and 5, our approach captures a high level of detail about the graph topology. In addition,we propose two optimizations to construct the classification dataset from the graph.First, we propose adding negative examples to the graph as an alternative to the common approach of removing positive ones.Second, we show that it is important to control for the shortest-path distance when sampling pairs of nodes to form negative examples, since the difficulty of prediction varies with the distance.We experimentally demonstrate that using our proposed motif features in off-the-shelf classifiers results in up to 10 percentage points increase in accuracy over prior topology-based and feature-learning methods.
picture_as_pdf View full article Reproducible Research
Multitask Hopfield Networks (418)
Marco Frasca (Università degli Studi di Milano), Giuliano Grossi (Università degli Studi di Milano), Giorgio Valentini (Università degli Studi di Milano)

Multitask algorithms typically use task similarity information as a bias to speed up and improve the performance of learning processes. Tasks are learned jointly, sharing information across them, inorder to construct models more accurate than those learned separatelyover single tasks. In this contribution, we present the first multitaskmodel, to our knowledge, based on Hopfield Networks (HNs), namedHoMTask. We show that by appropriately building a unique HN embedding all tasks, a more robust and effective classification model can be learned. HoMTask is a transductive semi-supervised parametric HN, thatminimizes an energy function extended to all nodes and to all tasks understudy. We provide theoretical evidence that the optimal parameters automatically estimated by HoMTask make coherent the model itself with the prior knowledge (connection weights and node labels). The convergence properties of HNs are preserved, and the fixed point reached by the network dynamics gives rise to the prediction of unlabeled nodes. The proposed model improves the classification abilities of singletask HNs on a preliminary benchmark comparison, and achievescompetitive performance with state-of-the-art semi-supervised graph-based algorithms.
picture_as_pdf View full article
An Engineered Empirical Bernstein Bound (435)
Mark A. Burgess (Australian National University), Archie C. Chapman (University of Sydney), Paul Scott (Australian National University)

We derive a tightened empirical Bernstein bound (EBB) on the variation of the sample mean from the population mean, and show that it improves the performance of upper confidence bound (UCB) methods in multi-armed bandit problems.Like other EBBs, our EBB is a concentration inequality for the variation of the sample mean in terms of the sample variance.Its derivation uses a combination of probability unions and Chernoff bounds for the mean of samples and mean of sample squares.Analysis reveals that our approach can tighten the best existing EBBs by about a third, and thereby halves the distance to a bound constructed with perfect variance information.We illustrate the practical usefulness of our novel EBB by applying it to a multi-armed bandit problem as a component of a UCB method. Our method outperforms existing approaches by producing lower expected regret than variants of UCB employing several other bounds, including state-of-the-art EBBs.
picture_as_pdf View full article Reproducible Research
Neural Message Passing for Multi-Label Classification (438)
Jack Lanchantin (University of Virginia), Arshdeep Sekhon (University of Virginia), Yanjun Qi (University of Virginia)

Multi-label classification (MLC) is the task of assigning a set of target labels for a given sample. Modeling the combinatorial label interactions in MLC has been a long-haul challenge. We propose Label Message Passing (LaMP) Neural Networks to efficiently model the joint prediction of multiple labels. LaMP treats labels as nodes on a label-interaction graph and computes the hidden representation of each label node conditioned on the input using attention-based neural message passing. Attention enables LaMP to assign different importances to neighbor nodes per label, learning how labels interact (implicitly). The proposed models are simple, accurate, interpretable, structure-agnostic, and applicable for predicting dense labels since LaMP is incredibly parallelizable. We validate the benefits of LaMP on seven real-world MLC datasets, covering a broad spectrum of input/output types and outperforming the state-of-the-art results. Notably, LaMP enables intuitive interpretation of how classifying each label depends on the elements of a sample and at the same time rely on its interaction with other labels.
picture_as_pdf View full article Reproducible Research
CatchCore: Catching Hierarchical Dense Subtensor (451)
Wenjie Feng (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences), Shenghua Liu (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences), Huawei Shen (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences), Xueqi Cheng (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences)

Dense subtensor detection gains remarkable success in spotting anomaly and fraudulent behaviors for the multi-aspect data (i.e., tensors), like in social media and event streams.Existing methods detect the densest subtensors flatly and separately, with an underlying assumption that those subtensors are exclusive.However, many real-world tensors usually present hierarchical properties, e.g., the core-periphery structure or dynamic communities in networks. In this paper, we propose CatchCore, a novel framework to effectively find the hierarchical dense subtensors. We first design a unified metric for dense subtensor detection, which can be optimized with gradient-based methods. With the proposed metric, detects hierarchical dense subtensors through the hierarchy-wise alternative optimization.Finally, we utilize the minimum description length principle to measure the quality of detection result and select the optimal hierarchical dense subtensors.Extensive experiments on synthetic and real-world datasets demonstrate that outperforms the top competitors in accuracy for detecting dense subtensors and anomaly patterns. Additionally, CatchCore successfully identified a hierarchical researcher co-authorship group with intense interactions in DBLP dataset. Meanwhile, CatchCore also scales linearly with all aspects of tensors.
picture_as_pdf View full article Reproducible Research
Fast and Parallelizable Ranking with Outliers from Pairwise Comparisons (468)
Sungjin Im (University of California), Mahshid Montazer Qaem (University of California)

In this paper, we initiate the study of the problem of ordering objects from their pairwise comparison results when allowed to discard up to a certain number of objects as outliers. More specifically, we seek to find an ordering under the popular Kendall tau distance measure, i.e., minimizing the number of pairwise comparison results that are inconsistent with the ordering, with some outliers removed. The presence of outliers challenges the assumption that a global consistent ordering exists and obscures the measure. This problem does not admit a polynomial time algorithm unless NP ⊆ BPP, and therefore, we develop approximation algorithms with provable guarantees for all inputs. Our algorithms have running time and memory usage that are almost linear in the input size. Further, they are readily adaptable to run on massively parallel platforms such as MapReduce or Spark.
picture_as_pdf View full article Reproducible Research
SoRecGAT: Leveraging Graph Attention Mechanism for Top-N Social Recommendation (475)
Vijaikumar M (Indian Institute of Science), Shirish Shevade (Indian Institute of Science), M. N. Murty (Indian Institute of Science)

Social recommendation systems typically combine extra information like a social network with the user-item interaction network in order to alleviate data sparsity issues. This also helps in making more accurate and personalized recommendations. However, most of the existing systems work under the assumption that all socially connected users have equal influence on each other in a social network, which is not true in practice. Further, estimating the quantum of influence that exists among entities in a user-item interaction network is essential when only implicit ratings are available. This has been ignored even in many recent state-of-the-art models such as SAMN (Social Attentional Memory Network) and DeepSoR (Deep neural network model on Social Relations). Many a time, capturing a complex relationship between the entities (users/items) is essential to boost the performance of a recommendation system. We address these limitations by proposing a novel neural network model, SoRecGAT, which employs multi-head and multi-layer graph attention mechanism. The attention mechanism helps the model learn the influence of entities on each other more accurately. The proposed model also takes care of heterogeneity among the entities seamlessly. SoRecGAT is a general approach and we also validate its suitability when information in the form of a network of co-purchased items is available. Empirical results on eight real-world datasets demonstrate that the proposed model outperforms state-of-the-art models.
picture_as_pdf View full article Reproducible Research
Shift Happens: Adjusting Classifiers (479)
Theodore James Thibault Heiser (University of Tartu), Mari-Liis Allikivi (University of Tartu), Meelis Kull (University of Tartu)

Minimizing expected loss measured by a proper scoring rule, such as Brier score or log-loss (cross-entropy), is a common objective while training a probabilistic classifier. If the data have experienced dataset shift where the class distributions change post-training, then often the model's performance will decrease, over-estimating the probabilities of some classes while under-estimating the others on average. We propose unbounded and bounded general adjustment (UGA and BGA) methods that transform all predictions to (re-)equalize the average prediction and the class distribution. These methods act differently depending on which proper scoring rule is to be minimized, and we have a theoretical guarantee of reducing loss on test data, if the exact class distribution is known. We also demonstrate experimentally that, when in practice the class distribution is known only approximately, there is often still a reduction in loss depending on the amount of shift and the precision to which the class distribution is known.
picture_as_pdf View full article Reproducible Research
Stochastic Activation Actor Critic Methods (483)
Wenling Shang (University of Amsterdam-Bosch-Deltalab), Herke van Hoof (University of Amsterdam-Bosch-Deltalab), Max Welling (University of Amsterdam-Bosch-Deltalab)

Stochastic elements in reinforcement learning (RL) have shown promise to improve exploration and handling of uncertainty, such as the utilization of stochastic weights in NoisyNets and stochastic policies in the maximum entropy RL frameworks. Yet effective and general approaches to include such elements in actor-critic models are still lacking. Inspired by the aforementioned techniques, we propose an effective way to inject randomness into actor-critic models to improve general exploratory behavior and reflect environment uncertainty. Specifically, randomness is added at the level of intermediate activations that feed into both policy and value functions to achieve better correlated and more complex perturbations. The proposed framework also features flexibility and simplicity, which allows straightforward adaptation to a variety of tasks. We test several actor-critic models enhanced with stochastic activations and demonstrate their effectiveness in a wide range of Atari 2600 games, a continuous control problem and a car racing task. Lastly, in a qualitative analysis, we present evidence of the proposed model adapting the noise in the policy and value functions to reflect uncertainty and ambiguity in the environment.
picture_as_pdf View full article Reproducible Research
An Algorithm for Reducing the Number of Distinct Branching Conditions in a Decision Forest (486)
Atsuyoshi Nakamura (Hokkaido University), Kento Sakurada (Hokkaido University)

Given a decision forest, we study a problem of reducing the number of its distinct branching conditionswithout changing each tree's structure while keeping classification performance.A decision forest with a smaller number of distinct branching conditions can not only have a smaller description lengthbut also be implemented by hardware more efficiently. To force the modified decision forest to keep classification performance, we consider a conditionthat the decision paths at each branching node do not change for 100σ% of the given feature vectors passing through the node for a given 0≤σ<1.Under this condition, we propose an algorithm that minimizes the number of distinct branching conditions by sharing the same condition among multiple branching nodes.According to our experimental results using 13 datasets in UCI machine learning repository, our algorithm succeeded more than 90
picture_as_pdf View full article
Beyond Bag-of-Concepts: Vectors of Locally Aggregated Concepts (489)
Maarten Grootendorst (Jheronimus Academy of Data Science), Joaquin Vanschoren (Eindhoven University of Technology)

Bag-of-Concepts, a model that counts the frequency of clustered word embeddings (i.e., concepts) in a document, has demonstrated the feasibility of leveraging clustered word embeddings to create features for document representation. However, information is lost as the word embeddings themselves are not used in the resulting feature vector. This paper presents a novel text representation method, Vectors of Locally Aggregated Concepts (VLAC). Like Bag-of-Concepts, it clusters word embeddings for its feature generation. However, instead of counting the frequency of clustered word embeddings, VLAC takes each cluster's sum of residuals with respect to its centroid and concatenates those to create a feature vector. The resulting feature vectors contain more discriminative information than Bag-of-Concepts due to the additional inclusion of these first order statistics. The proposed method is tested on four different data sets for single-label classification and compared with several baselines, including TF-IDF and Bag-of-Concepts. Results indicate that when combining features of VLAC with TF-IDF significant improvements in performance were found regardless of which word embeddings were used.
picture_as_pdf View full article Reproducible Research
Graph Signal Processing for Directed Graphs based on the Hermitian Laplacian (499)
Satoshi Furutani (NTT Secure Platform Laboratories, Tokyo), Toshiki Shibahara (NTT Secure Platform Laboratories, Tokyo), Mitsuaki Akiyama (NTT Secure Platform Laboratories, Tokyo), Kunio Hato (NTT Secure Platform Laboratories, Tokyo), Masaki Aida (Tokyo Metropolitan University)

Graph signal processing is a useful tool for representing, analyzing, and processing the signal lying on a graph, and has attracted attention in several fields including data mining and machine learning.A key to construct the graph signal processing is the graph Fourier transform, which is defined by using eigenvectors of the graph Laplacian of an undirected graph.The orthonormality of eigenvectors gives the graph Fourier transform algebraically desirable properties, and thus the graph signal processing for undirected graphs has been well developed.However, since eigenvectors of the graph Laplacian of a directed graph are generally not orthonormal, it is difficult to simply extend the graph signal processing to directed graphs.In this paper, we present a general framework for extending the graph signal processing to directed graphs.To this end, we introduce the Hermitian Laplacian which is a complex matrix obtained from an extension of the graph Laplacian.The Hermitian Laplacian is defined so as to preserve the edge directionality and Hermitian property and enables the graph signal processing to be straightforwardly extended to directed graphs.Furthermore, the Hermitian Laplacian guarantees some desirable properties, such as non-negative real eigenvalues and the unitarity of the Fourier transform.Finally, experimental results for representation learning and signal denoising of/on directed graphs show the effectiveness of our framework.
picture_as_pdf View full article
Incorporating Dependencies in Spectral Kernels for Gaussian Processes (510)
Kai Chen (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences; Radboud University; Shenzhen Engineering Laboratory of Ocean Environmental Big Data Analysis and Application), Twan van Laarhoven (Radboud University; Open University of The Netherlands), Jinsong Chen (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences; Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences; Shenzhen Engineering Laboratory of Ocean Environmental Big Data Analysis and Application), Elena Marchiori (Radboud University)

Gaussian processes (GPs) are an elegant Bayesian approach to model an unknown function. The choice of the kernel characterizes one's assumption on how the unknown function autocovaries. It is a core aspect of a GP design, since the posterior distribution can significantly vary for different kernels. The spectral mixture (SM) kernel is derived by modelling a spectral density - the Fourier transform of a kernel - with a linear mixture of Gaussian components. As such, the SM kernel cannot model dependencies between components. In this paper we use cross convolution to model dependencies between components and derive a new kernel called Generalized Convolution Spectral Mixture (GCSM). Experimental analysis of GCSM on synthetic and real-life datasets indicates the benefit of modeling dependencies between components for reducing uncertainty and for improving performance in extrapolation tasks.
picture_as_pdf View full article Reproducible Research
Learning Aligned-Spatial Graph Convolutional Networks for Graph Classification (542)
Lu Bai (Central University of Finance and Economics, Beijing, China), Yuhang Jiao (Central University of Finance and Economics, Beijing, China), Lixin Cui (Central University of Finance and Economics, Beijing, China), Edwin R. Hancock (Central University of Finance and Economics, Beijing, China)

In this paper, we develop a novel Aligned-Spatial Graph Convolutional Network (ASGCN) model to learn effective features for graph classification. Our idea is to transform arbitrary-sized graphs into fixed-sized aligned grid structures, and define a new spatial graph convolution operation associated with the grid structures. We show that the proposed ASGCN model not only reduces the problems of information loss and imprecise information representation arising in existing spatially-based Graph Convolutional Network (GCN) models, but also bridges the theoretical gap between traditional Convolutional Neural Network (CNN) models and spatially-based GCN models. Moreover, the proposed ASGCN model can adaptively discriminate the importance between specified vertices during the process of spatial graph convolution, explaining the effectiveness of the proposed model. Experiments on standard graph datasets demonstrate the effectiveness of the proposed model.
picture_as_pdf View full article
Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space (543)
Zac Wellmer (Hong Kong University of Science), James T. Kwok (Technology)

This paper proposes a novel deep reinforcement learning architecture that was inspired by previous tree structured architectures which were only useable in discrete action spaces. Policy Prediction Network offers a way to improve sample complexity and performance on continuous control problems in exchange for extra computation at training time but at no cost in computation at rollout time. Our approach integrates a mix between model-free and model-based reinforcement learning. Policy Prediction Network is the first to introduce implicit model-based learning to Policy Gradient algorithms for continuous action space and is made possible via the empirically justified clipping scheme. Our experiments are focused on the MuJoCo environments so that they can be compared with similar work done in this area.
picture_as_pdf View full article
Beyond the Selected Completely At Random Assumption for Learning from Positive and Unlabeled Data (544)
Jessa Bekker (KU Leuven), Pieter Robberechts (KU Leuven), Jesse Davis (KU Leuven)

Most positive and unlabeled data is subject to selection biases. The labeled examples can, for example, be selected from the positive set because they are easier to obtain or more obviously positive. This paper investigates how learning can be enabled in this setting. We propose and theoretically analyze an empirical-risk-based method for incorporating the labeling mechanism. Additionally, we investigate under which assumptions learning is possible when the labeling mechanism is not fully understood and propose a practical method to enable this. Our empirical analysis supports the theoretical results and shows that taking into account the possibility of a selection bias, even when the labeling mechanism is unknown, improves the trained classifiers.
picture_as_pdf View full article Reproducible Research
Fast Gradient Boosting Decision Trees with Bit-Level Data Structures (557)
Laurens Devos (KU Leuven), Wannes Meert (KU Leuven), Jesse Davis (KU Leuven)

A gradient boosting decision tree model is a powerful machine learning method that iteratively constructs decision trees to form an additive ensemble model. The method uses the gradient of the loss function to improve the model at each iteration step.Inspired by the databaseliterature, we exploit bitset and bitslice data structures in order toimprove the run time efficiency of learning the trees. We can use thesestructures in two ways. First, they can represent the input data itself.Second, they can store the discretized gradient values used by the learningalgorithm to construct the trees in the boosting model. Using thesebit-level data structures reduces the problem of finding thebest split, which involves counting of instances and summing gradientvalues, to counting one-bits in bit strings.Modern CPUs can efficiently count one-bits using AVX2 SIMD instructions.Empirically, ourproposed improvements can result in speed-ups of 2 to up to 10 times ondatasets with a large number of categorical feature without sacrificingpredictive performance.
picture_as_pdf View full article Reproducible Research
Maximal Closed Set and Half-Space Separations in Finite Closure Systems (559)
Florian Seiffarth (University of Bonn), Tamás Horváth (University of Bonn; Fraunhofer IAIS, Sankt Augustin; Fraunhofer Center for Machine Learning, Sankt Augustin), Stefan Wrobel (University of Bonn; Fraunhofer IAIS, Sankt Augustin; Fraunhofer Center for Machine Learning, Sankt Augustin)

Motivated by various binary classification problems in structured data (e.g., graphs or other relational and algebraic structures), we investigate some algorithmic properties of closed set and half-space separation in abstract closure systems.Assuming that the underlying closure system is finite and given by the corresponding closure operator, we formulate some negative and positive complexity results for these two separation problems.In particular, we prove that deciding half-space separability in abstract closure systems is NP-complete in general.On the other hand, for the relaxed problem of maximal closed set separation we propose a simple greedy algorithm and show that it is efficient and has the best possible lower bound on the number of closure operator calls.As a second direction to overcome the negative result above, we consider Kakutani closure systems and show first that our greedy algorithm provides an algorithmic characterization of this kind of set systems.As one of the major potential application fields, we then focus on Kakutani closure systems over graphs and generalize a fundamental characterization result based on the Pasch axiom to graph structure partitioning of finite sets.Though the primary focus of this work is on the generality of the results obtained, we experimentally demonstrate the practical usefulness of our approach on vertex classification in different graph datasets.Motivated by various binary classification problems in structured data (e.g., graphs or other relational and algebraic structures), we investigate some algorithmic properties of closed set and half-space separation in abstract closure systems.Assuming that the underlying closure system is finite and given by the corresponding closure operator, we formulate some negative and positive complexity results for these two separation problems.In particular, we prove that deciding half-space separability in abstract closure systems is NP-complete in general.On the other hand, for the relaxed problem of maximal closed set separation we propose a simple greedy algorithm and show that it is efficient and has the best possible lower bound on the number of closure operator calls.As a second direction to overcome the negative result above, we consider Kakutani closure systems and show first that our greedy algorithm provides an algorithmic characterization of this kind of set systems.As one of the major potential application fields, we then focus on Kakutani closure systems over graphs and generalize a fundamental characterization result based on the Pasch axiom to graph structure partitioning of finite sets.Though the primary focus of this work is on the generality of the results obtained, we experimentally demonstrate the practical usefulness of our approach on vertex classification in different graph datasets.
picture_as_pdf View full article
Assessing the multi-labelness of multi-label data (562)
Laurence A. F. Park (Western Sydney University), Yi Guo (Western Sydney University), Jesse Read (École Polytechnique)

Before constructing a classifier, we should examine the data to gainan understanding of the relationships between the variables, toassist with the design of the classifier. Using multi-label datarequires us to examine the association between labels: its multi-labelness. We cannotdirectly measure association between two labels, since the labels'relationships are confounded with the set of observation variables.A better approach is to fit an analytical model to a label withrespect to the observations and remaining labels, but this mightpresent false relationships due to the problem of multicollinearitybetween the observations and labels.In this article, we examine the utility of regularised logisticregression and a new form of split logistic regression for assessingthe multi-labelness of data.We find that a split analytical model usingregularisation is able to provide fewer label relationships when norelationships exist, or if the labels can be partitioned. We alsofind that if label relationships do exist, logistic regression with l_1 regularisationprovides the better measurement of multi-labelness.
picture_as_pdf View full article
A Semi-discriminative Approach for Sub-sentence Level Topic Classification on a Small Dataset (566)
Cornelia Ferner (Salzburg University of Applied Sciences), Stefan Wegenkittl (Salzburg University of Applied Sciences)

This paper aims at identifying sequences of words related to specific product components in online product reviews. A reliable baseline performance for this topic classification problem is given by a Max Entropy classifier which assumes independence over subsequent topics. However, the reviews exhibit an inherent structure on the document level allowing to frame the task as sequence classification problem. Since more flexible models from the class of Conditional Random Fields were not competitive because of the limited amount of training data available, we propose using a Hidden Markov Model instead and decouple the training of transition and emission probabilities. The discriminating power of the Max Entropy approach is used for the latter.Besides outperforming both standalone methods as well as more generic models such as linear-chain Conditional Random Fields, the combined classifier is able to assign topics on sub-sentence level although labeling in the training data is only available on sentence level.
picture_as_pdf View full article Reproducible Research
Black Box Explanation by Learning Image Exemplars in the Latent Feature Space (572)
Riccardo Guidotti (ISTI-CNR, Pisa), Anna Monreale (University of Pisa), Stan Matwin (Dalhousie University; Polish Academy of Sciences), Dino Pedreschi (University of Pisa)

We present an approach to explain the decisions of black box models for image classification. While using the black box to label images, our explanation method exploits the latent feature space learned through an adversarial autoencoder. The proposed method first generates exemplar images in the latent feature space and learns a decision tree classifier. Then, it selects and decodes exemplars respecting local decision rules. Finally, it visualizes them in a manner that shows to the user how the exemplars can be modified to either stay within their class, or to become counter-factuals by "morphing" into another class. Since we focus on black box decision systems for image classification, the explanation obtained from the exemplars also provides a saliency map highlighting the areas of the image that contribute to its classification, and areas of the image that push it into another class. We present the results of an experimental evaluation on three datasets and two black box models. Besides providing the most useful and interpretable explanations, we show that the proposed method outperforms existing explainers in terms of fidelity, relevance, coherence, and stability.
picture_as_pdf View full article Reproducible Research
Cost Sensitive Evaluation of Instance Hardness in Machine Learning (574)
Ricardo B. C. Prudêncio (Universidade Federal de Pernambuco)

Measuring hardness of individual instances in machine learning contributes to a deeper analysis of learning performance. This work proposes instance hardness measures for binary classification in cost-sensitive scenarios. Here cost curves are generated for each instance, defined as the loss observed for a pool of learning models for that instance along the range of cost proportions. Instance hardness is defined as the area under the cost curves and can be seen as an expected loss of difficulty along cost proportions. Different cost curves were proposed by considering common decision threshold choice methods in literature, thus providing alternative views of instance hardness.
picture_as_pdf View full article
Meta-Learning for Black-box Optimization (576)
Vishnu TV (TCS Research, New Delhi), Pankaj Malhotra (TCS Research, New Delhi), Jyoti Narwariya (TCS Research, New Delhi), Lovekesh Vig (TCS Research, New Delhi), Gautam Shroff (TCS Research, New Delhi)

Recently, neural networks trained as optimizers under the "learning to learn" or meta-learning framework have been shown to be effective for a broad range of optimization tasks including derivative-free black-box function optimization. Recurrent neural networks (RNNs) trained to optimize a diverse set of synthetic non-convex differentiable functions via gradient descent have been effective at optimizing derivative-free black-box functions.In this work, we propose RNN-Opt: an approach for learning RNN-based optimizers for optimizing real-parameter single-objective continuous functions under limited budget constraints.Existing approaches utilize an observed improvement based meta-learning loss function for training such models. We propose training RNN-Opt by using synthetic non-convex functions with known (approximate) optimal values by directly using discounted regret as our meta-learning loss function. We hypothesize that a regret-based loss function mimics typical testing scenarios, and would therefore lead to better optimizers compared to optimizers trained only to propose queries that improve over previous queries.Further, RNN-Opt incorporates simple yet effective enhancements during training and inference procedures to deal with the following practical challenges: (i) Unknown range of possible values for the black-box function to be optimized, and (ii) Practical and domain-knowledge based constraints on the input parameters.We demonstrate the efficacy of RNN-Opt in comparison to existing methods on several synthetic as well as standard benchmark black-box functions along with an anonymized industrial constrained optimization problem.
picture_as_pdf View full article
Robust Anomaly Detection in Images using Adversarial Autoencoders (581)
Laura Beggel (Bosch Center for Artificial Intelligence, Renningen; Ludwig-Maximilians-University Munich), Michael Pfeiffer (Bosch Center for Artificial Intelligence, Renningen), Bernd Bischl (Ludwig-Maximilians-University Munich)

Reliably detecting anomalies in a given set of images is a task of high practical relevance for visual quality inspection, surveillance, or medical image analysis. Autoencoder neural networks learn to reconstruct normal images, and hence can classify those images as anomalies, where the reconstruction error exceeds some threshold. Here we analyze a fundamental problem of this approach when the training set is contaminated with a small fraction of outliers.We find that continued training of autoencoders inevitably reduces the reconstruction error of outliers, and hence degrades the anomaly detection performance. In order to counteract this effect, an adversarial autoencoder architecture is adapted, which imposes a prior distribution on the latent representation, typically placing anomalies into low likelihood-regions.Utilizing the likelihood model, potential anomalies can be identified and rejected already during training, which results in an anomaly detector that is significantly more robust to the presence of outliers during training.
picture_as_pdf View full article
Attentive Multi-Task Deep Reinforcement Learning (582)
Timo Bräm (ETH Zurich), Gino Brunner (ETH Zurich)

Sharing knowledge between tasks is vital for efficient learning in a multi-task setting.However, most research so far has focused on the easier case where knowledge transfer is not harmful, i.e., where knowledge from one task cannot negatively impact the performance on another task.In contrast, we present an approach to multi-task deep reinforcement learning based on attention that does not require any a-priori assumptions about the relationships between tasks. Our attention network automatically groups task knowledge into sub-networks on a state level granularity. It thereby achieves positive knowledge transfer if possible, and avoids negative transfer in cases where tasks interfere. We test our algorithm against two state-of-the-art multi-task/transfer learning approaches and show comparable or superior performance while requiring fewer network parameters.
picture_as_pdf View full article Reproducible Research
Non-parametric Bayesian Isotonic Calibration: Fighting Over-confidence in Binary Classification (587)
Mari-Liis Allikivi (University of Tartu), Meelis Kull (University of Tartu)

Classifiers can often output a score or a probability indicating how sure they are about the predicted class. Classifier calibration methods can map these into calibrated class probabilities, supporting cost-optimal decision making. Isotonic calibration is the standard non-parametric calibration method for binary classifiers, and it can be shown to yield the most likely monotonic calibration map on the given data, where monotonicity means that instances with higher predicted scores are more likely to be positive. Another non-parametric method is ENIR (ensemble of near-isotonic regression models) which allows for some non-monotonicity, but adds a penalty for it. We first demonstrate that these two methods tend to be over-confident and show that applying label smoothing improves calibration of both methods in more than 90% of studied cases. Unfortunately, label smoothing reduces confidence on the under-confident predictions also, and it does not reduce the raggedness of isotonic calibration. As the main contribution we propose a non-parametric Bayesian isotonic calibration method which has the flexibility of isotonic calibration to fit maps of all monotonic shapes but it adds smoothness and reduces over-confidence without requiring label smoothing. The method introduces a prior over piecewise linear monotonic calibration maps and uses a simple Monte Carlo sampling based approach to approximate the posterior mean calibration map.Our experiments demonstrate that on average the proposed method results in better calibrated probabilities than the state-of-the-art calibration methods, including isotonic calibration and ENIR.
picture_as_pdf View full article Reproducible Research
A Stochastic Quasi-Newton Method with Nesterov's Accelerated Gradient (594)
S. Indrapriyadarsini (Shizuoka University), Shahrzad Mahboubi (Shonan Institute of Technology), Hiroshi Ninomiya (Shonan Institute of Technology), Hideki Asai (Shizuoka University)

Incorporating second order curvature information in gradient based methods have shown to improve convergence drastically despite its computational intensity. In this paper, we propose a stochastic (online) quasi-Newton method with Nesterov's accelerated gradient in both its full and limited memory forms for solving large scale non-convex optimization problems in neural networks. The performance of the proposed algorithm is evaluated in Tensorflow on benchmark classification and regression problems. The results show improved performance compared to the classical second order oBFGS and oLBFGS methods and popular first order stochastic methods such as SGD and Adam. The performance with different momentum rates and batch sizes have also been illustrated.
picture_as_pdf View full article
Shrinkage Estimators for Uplift Regression (595)
Krzysztof Rudaś (Warsaw University of Technology; Institute of Computer Science, Polish Academy of Sciences), Szymon Jaroszewicz (Institute of Computer Science, Polish Academy of Sciences)

Uplift modeling is an approach to machine learning which allows forpredicting the net effect of an action (with respect to not taking theaction). To achieve this, the training population is divided into twoparts: the treatment group, which is subjected to the action, and thecontrol group, on which the action is not taken. Our task is toconstruct a model which will predict the difference between outcomes inthe treatment and control groups conditional on individual objects'features. When the group assignment is random, the model admits acausal interpretation. When we assume linear responses in bothgroups, the simplest way of estimating the net effect of the action onan individual is to build two separate linear ordinary least squares(OLS) regressions on the treatment and control groups and compute thedifference between their predictions. In classical linear modelsimprovements in accuracy can be achieved through the use of so calledshrinkage estimators such as the well known James-Stein estimator,which has a provably lower mean squared error than the OLSestimator. In this paper we investigate the use of shrinkageestimators in the uplift modeling problem. Unfortunately directgeneralization of the James-Stein estimator does not lead to improvedpredictions, nor does shrinking treatment and control modelsseparately. Therefore, we propose a new uplift shrinkage method whereestimators in the treatment and control groups are shrunk jointly so as tominimize the error in the predicted net effect of the action. Weprove that the proposed estimator does indeed improve on the doubleregression estimator.
picture_as_pdf View full article
Training Discrete-Valued Neural Networks with Sign Activations Using Weight Distributions (596)
Wolfgang Roth (Graz University of Technology), Günther Schindler (Ruprecht Karls University, Heidelberg), Holger Fröning (Ruprecht Karls University, Heidelberg), Franz Pernkopf (Graz University of Technology)

Since resource-constrained devices hardly benefit from the trend towards ever-increasing neural network (NN) structures, there is growing interest in designing more hardware-friendly NNs.In this paper, we consider the training of NNs with discrete-valued weights and sign activation functions that can be implemented more efficiently in terms of inference speed, memory requirements, and power consumption.We build on the framework of probabilistic forward propagations using the local reparameterization trick, where instead of training a single set of NN weights we rather train a distribution over these weights.Using this approach, we can perform gradient-based learning by optimizing the continuous distribution parameters over discrete weights while at the same time perform backpropagation through the sign activation.In our experiments, we investigate the influence of the number of weights on the classification performance on several benchmark datasets, and we show that our method achieves state-of-the-art performance.
picture_as_pdf View full article Reproducible Research
node2bits: Compact Time- and Attribute-aware Node Representations for User Stitching (597)
Di Jin (University of Michigan), Mark Heimann (University of Michigan), Ryan A. Rossi (Adobe Research), Danai Koutra (University of Michigan)

Identity stitching, the task of identifying and matching various online references (e.g., sessions over different devices and timespans) to the same user in real-world web services, is crucial for personalization and recommendations. However, traditional user stitching approaches, such as grouping or blocking, require quadratic pairwise comparisons between a massive number of user activities, thus posing both computational and storage challenges. Recent works, which are often application-specific, heuristically seek to reduce the amount of comparisons, but they suffer from low precision and recall. To solve the problem in an application-independent way, we take a heterogeneous network-based approach in which users (nodes) interact with content (e.g., sessions, websites), and may have attributes (e.g., location). We propose node2bits, an efficient framework that represents multi-dimensional features of node contexts with binary hashcodes.node2bits leverages feature-based temporal walks to encapsulate short- and long-term interactions between nodes in heterogeneous web networks, and adopts SimHash to obtain compact, binary representations and avoid the quadratic complexity for similarity search.Extensive experiments on large-scale real networks show that node2bits outperforms traditional techniques and existing works that generate real-valued embeddings by up to 5.16
picture_as_pdf View full article Reproducible Research
A Soft Affiliation Graph Model for Scalable Overlapping Community Detection (618)
Nishma Laitonjam (Insight Centre for Data Analytics, University College Dublin), Wěipéng Huáng (Insight Centre for Data Analytics, University College Dublin), Neil J. Hurley (Insight Centre for Data Analytics, University College Dublin)

We propose an overlapping community model based on the Affiliation Graph Model (AGM), that exhibits the pluralistic homophily property that the probability of a link between nodes increases with increasing number of shared communities. We take inspiration from the Mixed Membership Stochastic Blockmodel (MMSB), in proposing an edgewise community affiliation. This allows decoupling of community affiliations between nodes, opening the way to scalable inference. We show that our model corresponds to an AGM with soft community affiliations and develop a scalable algorithm based on a Stochastic Gradient Riemannian Langevin Dynamics(SGRLD) sampler. Empirical results show that the model can scale to network sizes that are beyond the capabilities of MCMC samplers of the standard AGM. We achieve comparable performance in terms of accuracy and run-time efficiency to scalable MMSB samplers.
picture_as_pdf View full article Reproducible Research
Synthetic Oversampling of Multi-Label Data based on Local Label Distribution (624)
Bin Liu (Aristotle University of Thessaloniki), Grigorios Tsoumakas (Aristotle University of Thessaloniki)

Class-imbalance is an inherent characteristic of multi-label data which affects the prediction accuracy of most multi-label learning methods. One efficient strategy to deal with this problem is to employ resampling techniques before training the classifier. Existing multi-label sampling methods alleviate the (global) imbalance of multi-label datasets. However, performance degradation is mainly due to rare sub-concepts and overlapping of classes that could be analysed by looking at the local characteristics of the minority examples, rather than the imbalance of the whole dataset. We propose a new method for synthetic oversampling of multi-label data that focuses on local label distribution to generate more diverse and better labeled instances. Experimental results on 13 multi-label datasets demonstrate the effectiveness of the proposed approach in a variety of evaluation measures, particularly in the case of an ensemble of classifiers trained on repeated samples of the original data.
picture_as_pdf View full article Reproducible Research
Trade-offs in Large-Scale Distributed Tuplewise Estimation and Learning (628)
Robin Vogel (Telecom Paris, LTCI, Institut Polytechnique de Paris; IDEMIA), Aurélien Bellet (INRIA), Stephan Clémençon (Telecom Paris, LTCI, Institut Polytechnique de Paris), Ons Jelassi (Telecom Paris, LTCI, Institut Polytechnique de Paris), Guillaume Papa (Telecom Paris, LTCI, Institut Polytechnique de Paris)

The development of cluster computing frameworks hasallowed practitioners to scale out various statistical estimation and machinelearning algorithms with minimal programming effort. This is especially true for machine learning problems whose objective function is nicely separable across individual data points, such as classification and regression.In contrast, statistical learning tasks involving pairs (ormore generally tuples) of data points - such as metric learning,clustering or ranking - do not lend themselves as easily todata-parallelismand in-memory computing.In this paper, we investigate how to balance between statistical performanceand computational efficiency in such distributed tuplewise statisticalproblems. We first propose a simple strategy based on occasionallyrepartitioningdata across workers between parallel computation stages, where the number ofrepartitioning steps rules the trade-off between accuracy and runtime. We thenpresent some theoretical results highlighting the benefits brought by theproposed method in terms of variance reduction, and extend our results todesigndistributed stochastic gradient descent algorithms for tuplewiseempiricalrisk minimization.Our results are supported by numerical experiments in pairwisestatistical estimation and learning on synthetic and real-world datasets.
picture_as_pdf View full article Reproducible Research
Fast likelihood-based change point detection (634)
Nikolaj Tatti (University of Helsinki)

Change point detection plays a fundamental role in many real-world applications,where the goal is to analyze and monitor the behaviour of a data stream.In this paper, we study change detection in binary streams. To this end, weuse a likelihood ratio between two models as a measure for indicatingchange. The first model is a single bernoulli variable while the second modeldivides the stored data in two segments, and models each segment with itsown bernoulli variable. Finding the optimal split can be done in O(n) time,where n is the number of entries since the last change point.This is too expensive for large n. To combat this we propose an approximationscheme that yields (1 - ϵ) approximation in O(ϵ^-1^2 n) time.The speed-up consists of several steps: First we reduce the number of possiblecandidates by adopting a known result from segmentation problems. We then showthat for fixed bernoulli parameters we can find the optimal change point inlogarithmic time. Finally, we show how to construct a candidate listof size O(ϵ^-1 n) for model parameters.We demonstrate empirically the approximation quality and the running time ofour algorithm, showing that we can gain a significant speed-up with a minimalaverage loss in optimality.
picture_as_pdf View full article Reproducible Research
Hyper-Parameter-Free Generative Modelling with Deep Boltzmann Trees (637)
Nico Piatkowski (TU Dortmund)

Deep neural networks achieve state-of-the-art results in various classification and synthetic data generation tasks. However, only little is known about why depth improves a model. We investigate the structure of stochastic deep neural works, also known as Deep Boltzmann Machines, to shed some light on this issue. While the best known results postulate an exponential dependence between the number of visible units and the depth of the model, we show that the required depth is upper bounded by the longest path in the underlying junction tree, which is at most linear in the number of visible units. Moreover, we show that the conditional independence structure of any categorical Deep Boltzmann Machine contains a sub-tree that allows the consistent estimation of the full joint probability mass function of all visible units. We connect our results to l_1-regularized maximum-likelihood estimation and Chow-Liu trees. Based on our theoretical findings, we present a new tractable version of Deep Boltzmann Machines, namely the Deep Boltzmann Tree (DBT). We provide a hyper-parameter-free algorithm for learning the DBT from data, and propose a new initialization method to enforce convergence to good solutions. Our findings provide some theoretical evidence for why a deep model might be beneficial. Experimental results on benchmark data show, that the DBT is a theoretical sound alternative to likelihood-free generative models.
picture_as_pdf View full article
Deep convolutional Gaussian processes (645)
Kenneth Blomqvist (Aalto University; Helsinki Institute for Information Technology HIIT), Samuel Kaski (Aalto University; Helsinki Institute for Information Technology HIIT), Markus Heinonen (Aalto University; Helsinki Institute for Information Technology HIIT)

We propose deep convolutional Gaussian processes, a deep Gaussian process architecture with convolutional structure. The model is a principled Bayesian framework for detecting hierarchical combinations of local features for image classification. We demonstrate greatly improved image classification performance compared to current convolutional Gaussian process approaches on the MNIST and CIFAR-10 datasets. In particular, we improve state-of-the-art CIFAR-10 accuracy by over 10 percentage points.
picture_as_pdf View full article Reproducible Research
Stochastic One-Sided Full-Information Bandit (646)
Haoyu Zhao (Tsinghua University), Wei Chen (Microsoft Research, Beijing)

In this paper, we study the stochastic version of the one-sided full information bandit problem, where we have K arms [K] = {1, 2, ..., K}, and playing arm i would gain reward from an unknown distribution for arm iwhile obtaining reward feedback for all arms j > i.One-sided full information bandit can model the online repeated second-price auctions, where the auctioneer could select the reserved price in each round and the bidders only reveal their bids when their bids are higher than the reserved price. In this paper, we present an elimination-based algorithm to solve the problem. Our elimination based algorithm achieves distribution independent regret upper bound O(√(T· (TK))), and distribution dependent bound O(( T + K)f(Δ)), where T is the time horizon, Δ is a vector of gaps between the mean reward of arms and the mean reward of the best arm, and f(Δ) is a formula depending on the gap vector that we will specify in detail. Our algorithm has the best theoretical regret upper bound so far.We also validate our algorithm empirically against other possible alternatives.
picture_as_pdf View full article
Sets of Robust Rules, and How to Find Them (650)
Jonas Fischer (Max Planck Institute for Informatics; Saarland University), Jilles Vreeken (CISPA Helmholtz Center for Information Security)

Association rules are among the most important concepts in data mining. Rules of the form X -> Y are simple to understand, simple to act upon, yet can model important local dependencies in data. The problem is, however, that there are so many of them. Both traditional and state-of-the-art frameworkstypically yield millions of rules, rather than identifying a small set of rules that capture the most important dependencies of the data.In this paper, we define the problem of association rule mining in terms of the Minimum Description Length principle.That is, we identify the best set of rules as the one that most succinctly describes the data.We show that the resulting optimization problem does not lend itself for exact search, and hence propose Grab, a greedy heuristic to efficiently discover good sets of noise-resistant rules directly from data.Through extensive experiments we show that, unlike the state-of-the-art, Grab does reliably recover the ground truth.On real world data we show it finds reasonable numbers of rules, that upon close inspection give clear insight in the local distribution of the data.
picture_as_pdf View full article Reproducible Research
A Drift-based Dynamic Ensemble Members Selection using Clustering for Time Series Forecasting (662)
Amal Saadallah (TU Dortmund), Florian Priebe (TU Dortmund), Katharina Morik (TU Dortmund)

Both complex and evolving nature of time series structure make forecasting among one of the most important and challenging tasks in time series analysis. Typical methods for forecasting are designed to model time-evolving dependencies between data observations. However, it is generally accepted that none of them is universally valid for every application. Therefore, methods for learning heterogeneous ensembles by combining a diverse set of forecasts together appear as a promising solution to tackle this task. Hitherto, in classical ML literature, ensemble techniques such as stacking, cascading and voting are mostly restricted to operate in a static manner. To deal with changes in the relative performance of models as well as changes in the data distribution, we propose a drift-aware meta-learning approach for adaptively selecting and combining forecasting models. Our assumption is that different forecasting models have different areas of expertise and a varying relative performance. Our method ensures dynamic selection of initial ensemble base models candidates through a performance drift detection mechanism. Since diversity is a fundamental component in ensemble methods, we propose a second stage selection with clustering that is computed after each drift detection. Predictions of final selected models are combined into a single prediction. An exhaustive empirical testing of the method was performed, evaluating both generalization error and scalability of the approach using time series from several real world domains. Empirical results show the competitiveness of the method in comparison to state-of-the-art approaches for combining forecasters.
picture_as_pdf View full article Reproducible Research
Node Classification for Signed Social Networks Using Diffuse Interface Methods (665)
Pedro Mercado (University of Tübingen), Jessica Bosch (University of British Columbia), Martin Stoll (Technische Universität Chemnitz)

Signed networks contain both positive and negative kinds of interactions like friendship and enmity. The task of node classification in non-signed graphs has proven to be beneficial in many real world applications, yet extensions to signed networks remain largely unexplored. In this paper we introduce the first analysis of node classification in signed social networks via diffuse interface methods based on the Ginzburg-Landau functional together with different extensions of the graph Laplacian to signed networks. We show that blending the information from both positive and negative interactions leads to performance improvement in real signed social networks, consistently outperforming the current state of the art.
picture_as_pdf View full article Reproducible Research
Thu, 14:00 - 14:20 @ 0.001 (Poster@Thu) Clustering
Holistic Assessment of Structure Discovery Capabilities of Clustering Algorithms (678)
Frank Höppner (Ostfalia University of Applied Sciences), Maximilian Jahnke (Ostfalia University of Applied Sciences)

Existing cluster validity indices often possess a similar bias as theclustering algorithm they were introduced for, e.g. to determine theoptimal number of clusters. We suggest an efficient and holisticassessment of the structure discovery capabilities of clusteringalgorithms based on three criteria. We determine the robustness orstability of cluster assignments and interpret it as the confidence ofthe clustering algorithm in its result. This information is then usedto label the data and evaluate the consistency of thestability-assessment with the notion of a cluster as an area of denseand separated data. The resulting criteria of stability, structure andconsistency provide interpretable means to judge the capabilities ofclustering algorithms without the typical biases of prominent indices,including the judgment of a clustering tendency.
picture_as_pdf View full article Reproducible Research
Pattern-Based Anomaly Detection in Mixed-Type Time Series (681)
Len Feremans (University of Antwerp), Vincent Vercruyssen (KU Leuven), Boris Cule (University of Antwerp), Wannes Meert (KU Leuven), Bart Goethals (University of Antwerp; Monash University)

The present-day accessibility of technology enables easy logging of both sensor values and event logs over extended periods. In this context, detecting abnormal segments in time series data has become an important data mining task. Existing work on anomaly detection focuses either on continuous time series or discrete event logs and not on the combination.However, in many practical applications, the patterns extracted from the event log can reveal contextual and operational conditions of a device that must be taken into account when predicting anomalies in the continuous time series.This paper proposes an anomaly detection method that can handle mixed-type time series. The method leverages frequent pattern mining techniques to construct an embedding of mixed-type time series on which an isolation forest is trained. Experiments on several real-world univariate and multivariate time series, as well as a synthetic mixed-type time series, show that our anomaly detection algorithm outperforms state-of-the-art anomaly detection techniques such as MatrixProfile, Pav, Mifpod and Fpof.
picture_as_pdf View full article Reproducible Research
Sobolev Training with Approximated Derivatives for Black-Box Function Regression with Neural Networks (725)
Matthias Kissel (Technical University of Munich), Klaus Diepold (Technical University of Munich)

With Sobolev Training, neural networks are trained to fit target output values as well as target derivatives with respect to the inputs. This leads to better generalization and fewer required training examples for certain problems. In this paper, we present a training pipeline that enables Sobolev Training for regression problems where target derivatives are not directly available. Thus, we propose to use a least-squares estimate of the target derivatives based on function values of neighboring training samples. We show for a variety of black-box function regression tasks that our training pipeline achieves smaller test errors compared to the traditional training method. Since our method has no additional requirements on the data collection process, it has great potential to improve the results for various regression tasks.
picture_as_pdf View full article Reproducible Research
Tue, 10:00 - 10:30 @ 0.004 (AOK-HS) (Poster@Wed) Best ML Paper
Agnostic feature selection (744)
Guillaume Doquet (TAU CNRS - INRIA - LRI - Université Paris-Saclay), Michèle Sebag (TAU CNRS - INRIA - LRI - Université Paris-Saclay)

Unsupervised feature selection is mostly assessed along a supervised learning setting, depending on whether the selected features efficiently permit to predict the (unknown) target variable. Another setting is proposed in this paper: the selected features aim to efficiently recover the whole dataset. The proposed algorithm, called AgnoS, combines an AutoEncoder with structural regularizations to sidestep the combinatorial optimization problem at the core of feature selection. The extensive experimental validation of AgnoS on the scikit-feature benchmark suite demonstrates its ability compared to the state of the art, both in terms of supervised learning and data compression.
picture_as_pdf View full article
BelMan: An Information-Geometric Approach to Stochastic Bandits (783)
Debabrota Basu (Chalmers University of Technology), Pierre Senellart (DI ENS, ENS, CNRS, PSL University; INRIA), Stéphane Bressan (National University of Singapore)

We propose a Bayesian information-geometric approach to the exploration-exploitation trade-off in stochastic multi-armed bandits. The uncertainty on reward generation and belief is represented using the manifold of joint distributions of rewards and beliefs. Accumulated information is summarised by the barycentre of joint distributions, the pseudobelief-reward. While the pseudobelief-reward facilitates information accumulation through exploration, another mechanism is needed to increase exploitation by gradually focusing on higher rewards, the pseudobelief-focal-reward. Our resulting algorithm, BelMan, alternates between projection of the pseudobelief-focal-reward onto belief-reward distributions to choose the arm to play, and projection of the updated belief-reward distributions onto the pseudobelief-focal-reward. We theoretically prove BelMan to be asymptotically optimal and to incur a sublinear regret growth. We instantiate BelMan to stochastic bandits with Bernoulli and exponential rewards, and to a real-life application of scheduling queueing bandits. Comparative evaluation with the state of the art shows that BelMan is not only competitive for Bernoulli bandits but in many cases also outperforms other approaches for exponential and queueing bandits.
picture_as_pdf View full article Reproducible Research
L_0-ARM: Network Sparsification via Stochastic Binary Optimization (785)
Yang Li (Georgia State University), Shihao Ji (Georgia State University)

We consider network sparsification as an L_0-norm regularized binary optimization problem, where each unit of a neural network (e.g., weight, neuron, or channel, etc.) is attached with a stochastic binary gate, whose parameters are jointly optimized with original network parameters. The Augment-Reinforce-Merge (ARM), a recently proposed unbiased gradient estimator, is investigated for this binary optimization problem. Compared to the hard concrete gradient estimator from Louizos et al., ARM demonstrates superior performance of pruning network architectures while retaining almost the same accuracies of baseline methods. Similar to the hard concrete estimator, ARM also enables conditional computation during model training but with improved effectiveness due to the exact binary stochasticity. Thanks to the flexibility of ARM, many smooth or non-smooth parametric functions, such as scaled sigmoid or hard sigmoid, can be used to parameterize this binary optimization problem and the unbiasness of the ARM estimator is retained, while the hard concrete estimator has to rely on the hard sigmoid function to achieve conditional computation and thus accelerated training. Extensive experiments on multiple public datasets demonstrate state-of-the-art pruning rates with almost the same accuracies of baseline methods. The resulting algorithm L_0-ARM sparsifies the Wide-ResNet models on CIFAR-10 and CIFAR-100 while the hard concrete estimator cannot.
picture_as_pdf View full article
Learning with Random Learning Rates (805)
Léonard Blier (TAU, LRI, Inria, Université Paris Sud; Facebook AI Research), Pierre Wolinski (TAU, LRI, Inria, Université Paris Sud), Yann Ollivier (Facebook AI Research)

In neural network optimization, the learning rate of the gradient descent strongly affects performance. This prevents reliable out-of-the-box training of a model on a new problem. We propose theAll Learning Rates At Once (Alrao) algorithm fordeep learning architectures:each neuron or unit in the network gets its own learning rate, randomly sampled atstartupfrom a distribution spanning several orders of magnitude. The network becomes a mixture of slow and fast learning units. Surprisingly, Alrao performs close to SGD with an optimally tuned learning rate, for various tasks and network architectures. In our experiments, all Alrao runs were able to learn well without any tuning.
picture_as_pdf View full article Reproducible Research
Triangle Completion Time Prediction using Time-conserving Embedding (806)
Vachik S. Dave (Indiana University Purdue University Indianapolis), Mohammad Al Hasan (Indiana University Purdue University Indianapolis)

A triangle is an important building block of social networks, so the study of triangleformation in a network is critical for better understanding of the dynamics of suchnetworks. Existing works in this area mainly focus on triangle counting, or generating synthetic networks by matching the prevalence of trianglesin real-life networks. While these efforts increase our understanding of triangle's role in a network, they have limited practical utility. In this work we undertake aninteresting problem relating to triangle formation in a network, which is, to predict the time by which the third link of a triangle appears in a network. Since the thirdlink completes a triangle, we name this task as Triangle CompletionTime Prediction (TCTP). Solution to TCTP problem is valuable forreal-life link recommendation in social/e-commerce networks, also it provides vital information for dynamic network analysis and community generation study.An efficient and robust framework (GraNiTE) is proposed for solving the TCTP problem.GraNiTE uses neural networks based approach for learning a representation vector of a trianglecompleting edge, which is a concatenation of two representation vectors: first one is learnt from graphlet based local topology around that edge and the second one islearnt from time-preserving embedding of the constituting vertices of that edge. A comparison of the proposed solution with several baselinemethods shows that the mean absolute error (MAE) of the proposed method is at least one-forth of that of the best baseline method.
picture_as_pdf View full article Reproducible Research
Thu, 14:20 - 14:40 @ 0.001 (Poster@Thu) Clustering
k is the Magic Number - Inferring the Number of Clusters Through Nonparametric Concentration Inequalities (811)
Sibylle Hess (Technische Universiteit Eindhoven), Wouter Duivesteijn (Technische Universiteit Eindhoven)

Most convex and nonconvex clustering algorithms come with one crucial parameter: the k in k-means. To this day, there is not one generally accepted way to accurately determine this parameter. Popular methods are simple yet theoretically unfounded, such as searching for an elbow in the curve of a given cost measure. In contrast, statistically founded methods often make strict assumptions over the data distribution or come with their own optimization scheme for the clustering objective. This limits either the set of applicable datasets or clustering algorithms. In this paper, we strive to determine the number of clusters by answering a simple question: given two clusters, is it likely that they jointly stem from a single distribution? To this end, we propose a bound on the probability that two clusters originate from the distribution of the unified cluster, specified only by the sample mean and variance. Our method is applicable as a simple wrapper to the result of any clustering method minimizing the objective of k-means, which includes Gaussian mixtures and Spectral Clustering. We focus in our experimental evaluation on an application for nonconvex clustering and demonstrate the suitability of our theoretical results. Our SpecialK clustering algorithm automatically determines the appropriate value for k, without requiring any data transformation or projection, and without assumptions on the data distribution. Additionally, it is capable to decide that the data consists of only a single cluster, which many existing algorithms cannot.
picture_as_pdf View full article Reproducible Research
Tue, 15:20 - 15:40 @ 1.011 (Poster@Wed) Ranking
Sequential Learning over Implicit Feedback for Robust Large-Scale Recommender Systems (820)
Aleksandra Burashnikova (Skolkovo Institute of Science and Technology), Yury Maximov (Theoretical Division T-5 and CNLS, Los Alamos National Laboratory), Massih-Reza Amini (Université Grenoble Alpes)

In this paper, we propose a theoretically founded sequentialstrategy for training large-scale Recommender Systems (RS) over implicitfeedback mainly in the form of clicks. The proposed approach consistsin minimizing pairwise ranking loss over blocks of consecutive itemsconstituted by a sequence of non-clicked items followed by a clicked onefor each user. Parameter updates are discarded if for a given user thenumber of sequential blocks is below or above some given thresholdsestimated over the distribution of the number of blocks in the training set.This is to prevent from updating the parameters for an abnormally highnumber of clicks over some targeted items, mainly due to bots; or very fewuser interactions. Both scenarios affect the decision of RS and imply a shiftover the distribution of items that are shown to the users. We provide aproof of convergence of the algorithm to the minimizer of the ranking loss,in the case where the latter is convex. Furthermore, experimental resultson five large-scale collections demonstrate the efficiency of the proposedalgorithm concerning the state-of-the-art approaches, both regardingdifferent ranking measures and computation time.
picture_as_pdf View full article Reproducible Research
Generating Black-Box Adversarial Examples for Text Classifiers Using a Deep Reinforced Model (852)
Prashanth Vijayaraghavan (MIT Media Lab), Deb Roy (MIT Media Lab)

Recently, generating adversarial examples has become an important means of measuring robustness of a deep learning model. Adversarial examples help us identify the susceptibilities of the model and further counter those vulnerabilities by applying adversarial training techniques. In natural language domain, small perturbations in the form of misspellings or paraphrases can drastically change the semantics of the text. We propose a reinforcement learning based approach towards generating adversarial examples in black-box settings. We demonstrate that our method is able to fool well-trained models for (a) IMDB sentiment classification task and (b) AG's news corpus news categorization task with significantly high success rates. We find that the adversarial examples generated are semantics-preserving perturbations to the original text.
picture_as_pdf View full article
Thu, 10:00 - 10:30 @ 0.004 (AOK-HS) (Poster@Thu) Best DM Paper
FastPoint: Scalable Deep Point Processes (861)
Ali Caner Türkmen (Boğaziçi University), Yuyang Wang (Amazon Research), Alexander J. Smola (Amazon Research)

We propose FastPoint, a novel multivariate point process that enables fast and accurate learning and inference. FastPoint uses deep recurrent neural networks to capture complex temporal dependency patterns among different marks, while self-excitation dynamics within each mark are modeled with Hawkes processes. This results in substantially more efficient learning and scales to millions of correlated marks with superior predictive accuracy. Our construction also allows for efficient and parallel sequential Monte Carlo sampling for fast predictive inference. FastPoint outperforms baseline methods in prediction tasks on synthetic and real-world high-dimensional event data at a small fraction of the computational cost.
picture_as_pdf View full article
From abstract items to latent spaces to observed data and back: Compositional Variational Auto-Encoder (869)
Victor Berger (TAU, CNRS - INRIA - LRI - Univ. Paris-Saclay), Michele Sebag (TAU, CNRS - INRIA - LRI - Univ. Paris-Saclay)

Conditional Generative Models are now acknowledged an essential tool in Machine Learning. This paper focuses on their control. While many approaches aim at disentangling the data through the coordinate-wise control of their latent representations, another direction is explored in this paper. The proposed CompVAE handles data with a natural multi-ensemblist structure (i.e. that can naturally be decomposed into elements). Derived from Bayesian variational principles, CompVAE learns a latent representation leveraging both observational and symbolic information. A first contribution of the approach is that this latent representation supports a compositional generative model, amenable to multi-ensemblist operations (addition or subtraction of elements in the composition). This compositional ability is enabled by the invariance and generality of the whole framework w.r.t. respectively, the order and number of the elements. The second contribution of the paper is a proof of concept on synthetic 1D and 2D problems, demonstrating the efficiency of the proposed approach.
picture_as_pdf View full article Reproducible Research
Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours (880)
Dimitrios Stamoulis (Carnegie Mellon University), Ruizhou Ding (Carnegie Mellon University), Di Wang (Microsoft), Dimitrios Lymberopoulos (Microsoft), Bodhi Priyantha (Microsoft), Jie Liu (Harbin Institute of Technology), Diana Marculescu (Carnegie Mellon University)

Can we automatically design a Convolutional Network (ConvNet) with thehighest image classification accuracy under the latency constraint of a mobile device? Neural architecture search (NAS) has revolutionizedthe design of hardware-efficient ConvNets by automating this process.However, the NAS problem remains challenging due to the combinatorially large design space, causing a significant searching time (at least 200 GPU-hours). To alleviate this complexity, we propose Single-Path NAS, a novel differentiable NAS method for designing hardware-efficient ConvNets in less than 4 hours. Our contributions are as follows: 1. Single-path search space: Compared to previous differentiable NAS methods, Single-Path NAS uses one single-path over-parameterized ConvNet to encode all architectural decisions with shared convolutionalkernel parameters, hence drastically decreasing the number oftrainable parameters and the search cost down to few epochs. 2. Hardware-efficient ImageNet classification: Single-Path NAS achieves 74.96
picture_as_pdf View full article Reproducible Research
Bayesian Generalized Horseshoe Estimation of Generalized Linear Models (902)
Daniel F. Schmidt (Monash University; University of Melbourne), Enes Makalic (University of Melbourne)

Bayesian global-local shrinkage estimation with the generalized horseshoe prior represents the state-of-the-art for Gaussian regression models. The extension to non-Gaussian data, such as binary or Student-t regression, is usually done by exploiting a scale-mixture-of-normals approach. However, many standard distributions, such as the gamma and the Poisson, do not admit such a representation. We contribute two extensions to global-local shrinkage methodology. The first is an adaption of recent auxiliary gradient based-sampling schemes to the global-local shrinkage framework, which yields simple algorithms for sampling from generalized linear models. We also introduce two new samplers for the hyperparameters in the generalized horseshoe model, one based on an inverse-gamma mixture of inverse-gamma distributions, and the second a rejection sampler. Results show that these new samplers are highly competitive with the no U-turn sampler for small numbers of predictors, and potentially perform better for larger numbers of predictors. Results for hyperparameter sampling show our new inverse-gamma inverse-gamma based sampling scheme outperforms the standard sampler based on a gamma mixture of gamma distributions.
picture_as_pdf View full article Reproducible Research
Learning to Signal in the Goldilocks Zone: Improving Adversary Compliance in Security Games (923)
Sarah Cooney (University of Southern California), Kai Wang (University of Southern California), Elizabeth Bondi (University of Southern California), Thanh Nguyen (University of Oregon), Phebe Vayanos (University of Southern California), Hailey Winetrobe (University of Southern California), Edward A. Cranford (Carnegie Mellon University), Cleotilde Gonzalez (Carnegie Mellon University), Christian Lebiere (Carnegie Mellon University), Milind Tambe (University of Southern California)

Many real-world security scenarios can be modeled via a game-theoretic framework known as a security game in which there is a defender trying to protect potential targets from an attacker. Recent work in security games has shown that deceptive signaling by the defender can convince an attacker to withdraw his attack. For instance, a warning message to commuters indicating speed enforcement is in progress ahead might lead to them driving more slowly, even if it turns out no enforcement is in progress. However, the results of this work are limited by the unrealistic assumption that the attackers will behave with perfect rationality, meaning they always choose an action that gives them the best expected reward. We address the problem of training boundedly rational (human) attackers to comply with signals via repeated interaction with signaling without incurring a loss to the defender, and offer the four following contributions: (i) We learn new decision tree and neural network-based models of attacker compliance with signaling. (ii) Based on these machine learning models of a boundedly rational attacker's response to signaling, we develop a theory of signaling in the Goldilocks zone, a balance of signaling and deception that increases attacker compliance and improves defender utility. (iii) We present game-theoretic algorithms to solve for signaling schemes based on the learned models of attacker compliance with signaling. (iv) We conduct extensive human subject experiments using an online game. The game simulates the scenario of an inside attacker trying to steal sensitive information from company computers, and results show that our algorithms based on learned models of attacker behavior lead to better attacker compliance and improved defender utility compared to the state-of-the-art algorithm for rational attackers with signaling.
picture_as_pdf View full article
Fine-Grained Explanations using Markov Logic (937)
Khan Mohammad Al Farabi (University of Memphis), Somdeb Sarkhel (Adobe Research), Sanorita Dey (University of Illinois at Urbana-Champaign), Deepak Venugopal (University of Memphis)

Explaining the results of Machine learning algorithms is crucial given the rapid growth and potential applicability of these methods in critical domains including healthcare, defense, autonomous driving, etc. In this paper, we address this problem in the context of Markov Logic Networks (MLNs) which are highly expressive statistical relational models that combine first-order logic with probabilistic graphical models. MLNs in general are known to be interpretable models, i.e., MLNs can be understood more easily by humans as compared to models learned by approaches such as deep learning. However, at the same time, it is not straightforward to obtain human-understandable explanations specific to an observed inference result (e.g. marginal probability estimate). This is because, the MLN provides a lifted interpretation, one that generalizes to all possible worlds/instantiations, which are not query/evidence specific. In this paper, we extract grounded-explanations, i.e., explanations defined w.r.t specific inference queries and observed evidence. We extract these explanations from importance weights defined over the MLN formulas that encode the contribution of formulas towards the final inference results. We validate our approach in real world problems related to analyzing reviews from Yelp, and show through user-studies that our explanations are richer than state-of-the-art non-relational explainers such as LIME.
picture_as_pdf View full article
Generative Adversarial Networks for Failure Prediction (23)
Shuai Zheng (Industrial AI Lab, Hitachi America Ltd), Ahmed Farahat (Industrial AI Lab, Hitachi America Ltd), Chetan Gupta (Industrial AI Lab, Hitachi America Ltd)

Prognostics and Health Management (PHM) is an emerging engineering discipline which is concerned with the analysis and prediction of equipment health and performance. One of the key challenges in PHM is to accurately predict impending failures in the equipment. In recent years, solutions for failure prediction have evolved from building complex physical models to the use of machine learning algorithms that leverage the data generated by the equipment. However, failure prediction problems pose a set of unique challenges that make direct application of traditional classification and prediction algorithms impractical. These challenges include the highly imbalanced training data, the extremely high cost of collecting more failure samples, and the complexity of the failure patterns. Traditional oversampling techniques will not be able to capture such complexity and accordingly result in overfitting the training data. This paper addresses these challenges by proposing a novel algorithm for failure prediction using Generative Adversarial Networks (GAN-FP). GAN-FP first utilizes two GAN networks to simultaneously generate training samples and build an inference network that can be used to predict failures for new samples. GAN-FP first adopts an infoGAN to generate realistic failure and non-failure samples, and initialize the weights of the first few layers of the inference network. The inference network is then tuned by optimizing a weighted loss objective using only real failure and non-failure samples. The inference network is further tuned using a second GAN whose purpose is to guarantee the consistency between the generated samples and corresponding labels. GAN-FP can be used for other imbalanced classification problems as well. Empirical evaluation on several benchmark datasets demonstrates that GAN-FP significantly outperforms existing approaches, including under-sampling, SMOTE, ADASYN, weighted loss, and infoGAN augmented training.
picture_as_pdf View full article
Scalable Bid Landscape Forecasting in Real-time Bidding (123)
Aritra Ghosh (University of Massachusetts, Amherst), Saayan Mitra (Adobe Research), Somdeb Sarkhel (Adobe Research), Jason Xie (Adobe Advertising Cloud), Gang Wu (Adobe Research), Viswanathan Swaminathan (Adobe Research)

In programmatic advertising, ad slots are usually sold using second-price (SP) auctions in real-time. The highest bidding advertiser wins but pays only the second highest bid (known as the winning price). In SP, for a single item, the dominant strategy of each bidder is to bid the true value from the bidder's perspective. However, in a practical setting, with budget constraints, bidding the true value is a sub-optimal strategy. Hence, to devise an optimal bidding strategy, it is of utmost importance to learn the winning price distribution accurately. Moreover, a demand-side platform (DSP), which bids on behalf of advertisers, observes the winning price if it wins the auction. For losing auctions, DSPs can only treat its bidding price as the lower bound for the unknown winning price. In literature, typically censored regression is used to model such partially observed data. A common assumption in censored regression is that the winning price is drawn from a fixed variance (homoscedastic) uni-modal distribution (most often Gaussian). However, in reality, these assumptions are often violated.We relax these assumptions and propose a heteroscedastic fully parametric censored regression approach, as well as a mixture density censored network.Our approach not only generalizes censored regression but also provides flexibility to model arbitrarily distributed real-world data. Experimental evaluation on the publicly available dataset for winning price estimation demonstrates the effectiveness of our method. Furthermore, we evaluate our algorithm on one of thelargest demand-side platform and significant improvement has been achievedin comparison with the baseline solutions.
picture_as_pdf View full article
Interpreting atypical conditions in systems with deep conditional Autoencoders: the case of electrical consumption (175)
Antoine Marot (Reseau Transport Electricite R&D), Antoine Rosin (Reseau Transport Electricite R&D), Laure Crochepierre (Reseau Transport Electricite R&D; Universite de Lorraine), Benjamin Donnot (Reseau Transport Electricite R&D), Pierre Pinson (DTU Technical University of Denmark), Lydia Boudjeloud-Assala (Universite de Lorraine)

In this paper, we propose a new method to iteratively and interactively characterize new feature conditions for signals of daily French electrical consumption from our historical database, relying on Conditional Variational Autoencoders. An autoencoder first learn a compressed similarity-based representation of the signals in a latent space, in which one can select and extract well-represented expert features. Then, we successfully condition the model over the set of extracted features, as opposed to simple target label previously, to learn conditionally independent new residual latent representations. Unknown, or previously unselected factors such as atypical conditions now appear well-represented to be detected and further interpreted by experts. By applying it, we recover the appropriate known expert features and eventually discover, through adapted representations, atypical known and unknown conditions such as holidays, fuzzy non working days and weather events, which were actually related to important events that influenced consumption.
picture_as_pdf View full article Reproducible Research
Manufacturing Dispatching using Reinforcement and Transfer Learning (225)
Shuai Zheng (Industrial AI Lab, Hitachi America Ltd), Chetan Gupta (Industrial AI Lab, Hitachi America Ltd), Susumu Serita (Industrial AI Lab, Hitachi America Ltd)

Efficient dispatching rule in manufacturing industry is key to ensure product on-time delivery and minimum past-due and inventory cost. Manufacturing, especially in the developed world, is moving towards on-demand manufacturing meaning a high mix, low volume product mix. This requires efficient dispatching that can work in dynamic and stochastic environments, meaning it allows for quick response to new orders received and can work over a disparate set of shop floor settings. In this paper we address this problem of dispatching in manufacturing. Using reinforcement learning (RL), we propose a new design to formulate the shop floor state as a 2-D matrix, incorporate job slack time into state representation, and design lateness and tardiness rewards function for dispatching purpose. However, maintaining a separate RL model for each production line on a manufacturing shop floor is costly and often infeasible. To address this, we enhance our deep RL model with an approach for dispatching policy transfer. This increases policy generalization and saves time and cost for model training and data collection. Experiments show that: (1) our approach performs the best in terms of total discounted reward and average lateness, tardiness, (2) the proposed policy transfer approach reduces training time and increases policy generalization.
picture_as_pdf View full article
Automatic Recognition of Student Engagement using Deep Learning and Facial Expression (241)
Omid Mohamad Nezami (Macquarie University; CSIRO's Data61), Mark Dras (Macquarie University), Len Hamey (Macquarie University), Deborah Richards (Macquarie University), Stephen Wan (CSIRO's Data61), Cécile Paris (CSIRO's Data61)

Engagement is a key indicator of the quality of learning experience, and one that plays a major role in developing intelligent educational interfaces. Any such interface requires the ability to recognise the level of engagement in order to respond appropriately; however, there is very little existing data to learn from, and new data is expensive and difficult to acquire. This paper presents a deep learning model to improve engagement recognition from images that overcomes the data sparsity challenge by pre-training on readily available basic facial expression data, before training on specialised engagement data. In the first of two steps, a facial expression recognition model is trained to provide a rich face representation using deep learning. In the second step, we use the model's weights to initialize our deep learning based model to recognize engagement; we term this the engagement model. We train the model on our new engagement recognition dataset with 4627 engaged and disengaged samples. We find that the engagement model outperforms effective deep learning architectures that we apply for the first time to engagement recognition, as well as approaches using histogram of oriented gradients and support vector machines.
picture_as_pdf View full article Reproducible Research
Augmenting Semantic Representation of Depressive Language: from Forums to Microblogs (259)
Nawshad Farruque (University of Alberta), Osmar Zaiane (University of Alberta), Randy Goebel (University of Alberta)

We discuss and analyze the process of creating word embedding feature representations specifically designed for a learning task when annotated data is scarce, like depressive language detection from Tweets. We start from rich word embedding pre-trained from a general dataset, then enhance it with embedding learned from a domain specific but relatively much smaller dataset. Our strengthened representation portrays better the domain of depression we are interested in as it combines the semantics learned from the specific domain and word coverage from the general language. We present a comparative analyses of our word embedding representations with a simple bag-of-words model, a well known sentiment lexicon, a psycholinguistic lexicon, and a general pre-trained word embedding, based on their efficacy in accurately identifying depressive Tweets. We show that our representations achieves a significantly better F1 score than the others when applied to a high quality dataset.
picture_as_pdf View full article Reproducible Research
An aggregate learning approach for interpretable semi-supervised population prediction and disaggregation using ancillary data (284)
Guillaume Derval (ICTEAM, UCLouvain), Frédéric Docquier (IRES, UCLouvain), Pierre Schaus (ICTEAM, UCLouvain)

Census data provide detailed information about population characteristics at a coarse resolution. Nevertheless, fine-grained, high-resolution mappings of population counts are increasingly needed to characterize population dynamics and to assess the consequences of climate shocks, natural disasters, investments in infrastructure, development policies, etc. Dissagregating these census is a complex machine learning, and multiple solutions have been proposed in past research. We propose in this paper to view the problem in the context of the aggregate learning paradigm, where the output value for all training points is not known, but where it is only known for aggregates of the points (i.e. in this context, for regions of pixels where a census is available). We demonstrate with a very simple and interpretable model that this method is on par, and even outperforms on some metrics, the state-of-the-art, despite its simplicity.
picture_as_pdf View full article Reproducible Research
Augmenting Physiological Time Series Data: A Case Study for Sleep Apnea Detection (293)
Konstantinos Nikolaidis (University of Oslo), Stein Kristiansen (University of Oslo), Vera Goebel (University of Oslo), Thomas Plagemann (University of Oslo), Knut Liestøl (University of Oslo), Mohan Kankanhalli (National University of Singapore)

Supervised machine learning applications in the health domain often face the problem of insufficient training datasets. The quantity of labelled data is small due to privacy concerns and the cost of data acquisition and labelling by a medical expert. Furthermore, it is quite common that collected data are unbalanced and getting enough data to personalize models for individuals is very expensive or even infeasible. This paper addresses these problems by (1) designing a recurrent Generative Adversarial Network to generate realistic synthetic data and to augment the original dataset, (2) enabling the generation of balanced datasets based on a heavily unbalanced dataset, and (3) to control the data generation in such a way that the generated data resembles data from specific individuals. We apply these solutions for sleep apnea detection and study in the evaluation the performance of four well-known techniques, i.e., K-Nearest Neighbour, Random Forest, Multi-Layer Perceptron, and Support Vector Machine. All classifiers exhibit in the experiments a consistent increase in sensitivity and a kappa statisticincrease by between 0.72· 10^-2 and 18.2· 10^-2.
picture_as_pdf View full article
Marine Mammal Species Classification using Convolutional Neural Networks and a Novel Acoustic Representation (314)
Mark Thomas (Dalhousie University Faculty of Computer Science), Bruce Martin (JASCO Applied Sciences), Katie Kowarski (JASCO Applied Sciences), Briand Gaudet (JASCO Applied Sciences), Stan Matwin (Dalhousie University Faculty of Computer Science; Institute of Computer Science Polish Academy of Sciences)

Research into automated systems for detecting and classifying marine mammals in acoustic recordings is expanding internationally due to the necessity to analyze large collections of data for conservation purposes. In this work, we present a Convolutional Neural Network that is capable of classifying the vocalizations of three species of whales, non-biological sources of noise, and a fifth class pertaining to ambient noise. In this way, the classifier is capable of detecting the presence and absence of whale vocalizations in an acoustic recording. Through transfer learning, we show that the classifier is capable of learning high-level representations and can generalize to additional species. We also propose a novel representation of acoustic signals that builds upon the commonly used spectrogram representation by way of interpolating and stacking multiple spectrograms produced using different Short-time Fourier Transform (STFT) parameters. The proposed representation is particularly effective for the task of marine mammal species classification where the acoustic events we are attempting to classify are sensitive to the parameters of the STFT.
picture_as_pdf View full article
LSTM encoder-predictor for short-term train load forecasting (352)
Kevin Pasini (Université Paris-Est; IRT SystemX), Mostepha Khouadjia (Université Paris-Est), Allou Samé (IRT SystemX), Fabrice Ganansia (SNCF- Innovation & Recherche), Latifa Oukhellou (IRT SystemX)

The increase in the amount of data collected in the transport domain can greatly benefit mobility studies and help to create high value-added mobility services for passengers as well as regulation tools for operators. The research detailed in this paper is related to the development of an advanced machine learning approach with the aim of forecasting the passenger load of trains in public transport. Predicting the crowding level on public transport can indeed be useful for enriching the information available to passengers to enable them to better plan their daily trips. Moreover, operators will increasingly need to assess and predict network passenger load to improve train regulation processes and service quality levels. The main issues to address in this forecasting task are the variability in the train load series induced by the train schedule and the influence of several contextual factors, such as calendar information. We propose a neural network LSTM encoder-predictor combined with a contextual representation learning to address this problem. Experiments are conducted on a real dataset provided by the French railway company SNCF and collected over a period of one and a half years. The prediction performance provided by the proposed model are compared to those given by historical models and by traditional machine learning models. The obtained results have demonstrated the potential of the proposed LSTM encoder-predictor to address both one-step-ahead and multi-step forecasting and to outperform other models by maintaining robustness in the quality of the forecasts throughout the time horizon.
picture_as_pdf View full article
Learning Disentangled Representations of Satellite Image Time Series (372)
Eduardo H. Sanchez (IRT Saint Exupéry, Toulouse; IRIT, Université Toulouse III - Paul Sabatier), Mathieu Serrurier (IRT Saint Exupéry, Toulouse; IRIT, Université Toulouse III - Paul Sabatier), Mathias Ortner (IRT Saint Exupéry, Toulouse)

In this paper, we investigate how to learn a suitable representation of satellite image time series in an unsupervised manner by leveraging large amounts of unlabeled data. Additionally, we aim to disentangle the representation of time series into two representations: a shared representation that captures the common information between the images of a time series and an exclusive representation that contains the specific information of each image of the time series. To address these issues, we propose a model that combines a novel component called cross-domain autoencoders with the variational autoencoder (VAE) and generative adversarial network (GAN) methods. In order to learn disentangled representations of time series, our model learns the multimodal image-to-image translation task. We train our model using satellite image time series provided by the Sentinel-2 mission. Several experiments are carried out to evaluate the obtained representations. We show that these disentangled representations can be very useful to perform multiple tasks such as image classification, image retrieval, image segmentation and change detection.
picture_as_pdf View full article
A Deep Multi-Task Approach for Residual Value Forecasting (410)
Ahmed Rashed (University of Hildesheim), Shayan Jawed (University of Hildesheim), Jens Rehberg (Volkswagen Financial Services AG, Braunschweig), Josif Grabocka (University of Hildesheim), Lars Schmidt-Thieme (University of Hildesheim), Andre Hintsches (Volkswagen Financial Services AG, Braunschweig)

Residual value forecasting plays an important role in many areas, e.g., for vehicles to price leasing contracts. High forecasting accuracy is crucial as any overestimation will lead to lost sales due to customer dissatisfaction, while underestimation will lead to a direct loss in revenue when reselling the car at the end of the leasing contract. Current forecasting models mainly rely on the trend analysis of historical sales records. However, these models require extensive manual steps to filter and preprocess those records which in term limits the frequency at which these models can be updated with new data. To automate, improve and speed up residual value forecasting we propose a multi-task model that utilizes besides the residual value itself as the main target, the actual mileage that has been driven as a co-target. While combining off-the-shelf regression models with careful feature engineering yields already useful models, we show that for residual values further quantities such as the actual driven mileage contains further useful information. As this information is not available when contracts are made and the forecast is due, we model the problem as a multi-task model that uses actual mileage as a training co-target. Experiments on three Volkswagen car models show that the proposed model significantly outperforms the straight-forward modeling as a single-target regression problem.
picture_as_pdf View full article
Characterization and Early Detection of Evergreen News Articles (419)
Yiming Liao (Pennsylvania State University), Shugang Wang (The Washington Post), Eui-Hong (Sam) Han (Marriott International), Jongwuk Lee (Sungkyunkwan University), Dongwon Lee (Pennsylvania State University)

Although the majority of news articles are only viewed for days or weeks, there are a small fraction of news articles that are read across years, thus named as evergreen news articles. Because evergreen articles maintain a timeless quality and are consistently of interests to the public, understanding their characteristics better has huge implications for news outlets and platforms yet there are few studies that have explicitly investigated on evergreen articles. Addressing this gap, in this paper, we first propose a flexible parameterized definition of evergreen articles to capture their long-term high traffic patterns. Using a real dataset from the Washington Post, then, we unearth several distinctive characteristics of evergreen articles and build an early prediction model with encouraging results. Although less than 1
picture_as_pdf View full article
Transfer Learning in Credit Risk (491)
Hendra Suryanto (Rich Data Coroporation, Singapore), Charles Guan (Rich Data Coroporation, Singapore), Andrew Voumard (Rich Data Coroporation, Singapore), Ghassan Beydoun (University of Technology Sydney)

In the credit risk domain, lenders frequently face situations where there is no, or limited historical lending outcome data. This generally results in limited or unaffordable credit for some individuals and small businesses. Transfer learning can potentially reduce this limitation, by leveraging knowledge from related domains, with sufficient outcome data. We investigated the potential for applying transfer learning across various credit domains, for example, from the credit card lending and debt consolidation domain into the small business lending domain.
picture_as_pdf View full article Reproducible Research
Wearable-based Parkinson's Disease Severity Monitoring using Deep Learning (575)
Jann Goschenhofer (Dept. of Statistics, Ludwig-Maximilians University, Munich; ConnectedLife GmbH, Munich), Franz M. J. Pfister (Dept. of Statistics, Ludwig-Maximilians University, Munich; ConnectedLife GmbH, Munich), Kamer Ali Yuksel (ConnectedLife GmbH, Munich), Bernd Bischl (Dept. of Statistics, Ludwig-Maximilians University, Munich), Urban Fietzek (Dept. of Neurology, Ludwig-Maximilians University, Munich; Dept. of Neurology), Janek Thomas (Clinical Neurophysiology, Schoen Clinic Schwabing)

One major challenge in the medication of Parkinson's disease is that the severity of the disease, reflected in the patients' motor state, cannot be measured using accessible biomarkers. Therefore, we develop and examine a variety of statistical models to detect the motor state of such patients based on sensor data from a wearable device. We find that deep learning models consistently outperform a classical machine learning model applied on hand-crafted features in this time series classification task. Furthermore, our results suggest that treating this problem as a regression instead of an ordinal regression or a classification task is most appropriate. For consistent model evaluation and training, we adopt the leave-one-subject-out validation scheme to the training of deep learning models. We also employ a class-weighting scheme to successfully mitigate the problem of high multi-class imbalances in this domain. In addition, we propose a customized performance measure that reflects the requirements of the involved medical staff on the model. To solve the problem of limited availability of high quality training data, we propose a transfer learning technique which helps to improve model performance substantially. Our results suggest that deep learning techniques offer a high potential to autonomously detect motor states of patients with Parkinson's disease.
picture_as_pdf View full article
Optimizing Neural Networks for Patent Classification (619)
Louay Abdelgawad (Averbis GmbH, Freiburg), Peter Kluegl (Averbis GmbH, Freiburg), Erdan Genc (Averbis GmbH, Freiburg), Stefan Falkner (Albert-Ludwigs University of Freiburg), Frank Hutter (Albert-Ludwigs University of Freiburg)

A great number of patents is filed everyday to the patent offices worldwide. Each of these patents has to be labeled by domain experts with one or many of thousands of categories. This process is not only extremely expensive but also overwhelming for the experts, due to the considerable increase of filed patents over the years and the increasing complexity of the hierarchical categorization structure. Therefore, it is critical to automate the manual classification process using a classification model. In this paper, the automation of the task is carried out based on recent advances in deep learning for NLP and compared to customized approaches. Moreover, an extensive optimization analysis grants insights about hyperparameter importance. Our optimized convolutional neural network achieves a new state-of-the-art performance of 55.02
picture_as_pdf View full article Reproducible Research
Investigating Time Series Classification Techniques for Rapid Pathogen Identification with Single-Cell MALDI-TOF Mass Spectrum Data (620)
Christina Papagiannopoulou (Ghent University), René Parchen (BiosparQ B.V., Leiden), Willem Waegeman (Ghent University)

Matrix-assisted laser desorption/ionization-time-of-flight mass spectrometry (MALDI-TOF-MS) is a well-known technology, widely used in species identification. Specifically, MALDI-TOF-MS is applied on samples that usually include bacterial cells, generating representative signals for the various bacterial species. However, for a reliable identification result, a significant amount of biomass is required. For most samples used for diagnostics of infectious diseases, the sample volume is extremely low to obtain the required amount of biomass. Therefore, amplification of the bacterial load is performed by a culturing phase. If the MALDI process could be applied to individual bacteria, it would be possible to circumvent the need for culturing and isolation, accelerating the whole process. In this paper, we briefly describe an implementation of a MALDI-TOF MS procedure in a setting of individual cells and we demonstrate the use of the produced data for the application of pathogen identification. The identification of pathogens (bacterial species) is performed by using machine learning algorithms on the generated single-cell signals. The high predictive performance of the machine learning models indicates that the produced bacterial signatures constitute an informative representation, helpful in distinguishing the different bacterial species. In addition, we reformulate the bacterial species identification problem as a time series classification task by considering the intensity sequences of a given spectrum as time series values. Experimental results show that algorithms originally introduced for time series analysis are beneficial in modelling observations of single-cell MALDI-TOF MS.
picture_as_pdf View full article
The Search for Equations - Learning to Identify Similarities between Mathematical Expressions (657)
Lukas Pfahler (TU Dortmund University), Jonathan Schill (TU Dortmund University), Katharina Morik (TU Dortmund University)

On your search for scientific articles relevant to your research question, you judge the relevance of a mathematical expression that you stumble upon using extensive background knowledge about the domain, its problems and its notations. We wonder if machine learning can support this process and work toward implementing a search engine for mathematical expressions in scientific publications.Thousands of scientific publication with millions of mathematical expressions or equations are accessible at arXiv.org. We want to use this data to learn about equations, their distribution and their relations in order to find similar equations.To this end we propose an embedding model based on convolutional neural networks that maps bitmap images of equations into a low-dimensional vector-space where similarity is evaluated via dot-product.However, no annotated similarity data is available to train this mapping. We mitigate this by proposing a number of different unsupervised proxy tasks that use available featuresas weak labels. We evaluate our system using a number of metrics, including results on a small hand-labeled subset of equations. In addition, we show and discuss a number of result-sets for some sample queries.The results show that we are able to automatically identify related mathematical expressions.Our dataset is published at https://whadup.github.io/EquationLearning/and we invite the community to use it.
picture_as_pdf View full article Reproducible Research
Pushing the Limits of Exoplanet Discovery via Direct Imaging with Deep Learning (680)
Kai Hou Yip (University College London), Nikolaos Nikolaou (University College London), Piero Coronica (University of Cambridge), Angelos Tsiaras (University College London), Billy Edwards (University College London), Quentin Changeat (University College London), Mario Morvan (University College London), Beth Biller (University of Edinburgh), Sasha Hinkley (University of Exeter), Jeffrey Salmond (University of Cambridge), Matthew Archer (University of Cambridge), Paul Sumption (University of Cambridge), Elodie Choquet (Aix Marseille Univ), Remi Soummer (STScI), Laurent Pueyo (STScI), Ingo P. Waldmann (University College London)

Further advances in exoplanet detection and characterisation require sampling a diverse population of extrasolar planets. One technique to detect these distant worlds is through the direct detection of their thermal emission. The so-called direct imaging technique, is suitable for observing young planets far from their star. These are very low signal-to-noise-ratio (SNR) measurements and limited ground truth hinders the use of supervised learning approaches. In this paper, we combine deep generative and discriminative models to bypass the issues arising when directly training on real data. We use a Generative Adversarial Network to obtain a suitable dataset for training Convolutional Neural Network classifiers to detect and locate planets across a wide range of SNRs. Tested on artificial data, our detectors exhibit good predictive performance and robustness across SNRs. To demonstrate the limits of the detectors, we provide maps of the precision and recall of the model per pixel of the input image. On real data, the models can re-confirm bright source detections.
picture_as_pdf View full article Reproducible Research
Cold-Start Recommendation for On-Demand Cinemas (689)
Beibei Li (State Key Laboratory of Computer Sciences, Institute of Software, ChineseAcademy of Sciences; University of Chinese Academy of Sciences), Beihong Jin (State Key Laboratory of Computer Sciences, Institute of Software, ChineseAcademy of Sciences; University of Chinese Academy of Sciences), Taofeng Xue (State Key Laboratory of Computer Sciences, Institute of Software, ChineseAcademy of Sciences; University of Chinese Academy of Sciences), Kunchi Liu (State Key Laboratory of Computer Sciences, Institute of Software, ChineseAcademy of Sciences; University of Chinese Academy of Sciences), Qi Zhang (Beijing iQIYI Cinema Management Co., Ltd.), Sihua Tian (Beijing iQIYI Cinema Management Co., Ltd.)

The on-demand cinemas, which has emerged in recent years, provide offline entertainment venues for individuals and small groups. Because of the limitation of network speed and storage space, it is necessary to recommend movies to cinemas, that is, to suggest cinemas to download the recommended movies in advance. This is particularly true for new cinemas. For the new cinema cold-start recommendation, we build a matrix factorization framework and then fuse location categories of cinemas and co-popular relationship between movies in the framework. Specifically, location categories of cinemas are learned through LDA from the type information of POIs around the cinemas and used to approximate cinema latent representations. Moreover, a SPPMI matrix that reflects co-popular relationship between movies is constructed and factorized collectively with the interaction matrix by sharing the same item latent representations. Extensive experiments on real-world data are conducted. The experimental results show that the proposed approach yields significant improvements over state-of-the-art cold-start recommenders in terms of precision, recall and NDCG.
picture_as_pdf View full article
Data-driven Policy on Feasibility Determination for the Train Shunting Problem (690)
Paulo Roberto de Oliveira da Costa (Eindhoven University of Technology), Jason Rhuggenaath (Eindhoven University of Technology), Yingqian Zhang (Eindhoven University of Technology), Alp Akcay (Eindhoven University of Technology), Wan-Jui Lee (Dutch Railways), Uzay Kaymak (Eindhoven University of Technology)

Parking, matching, scheduling, and routing are common problems in train maintenance. In particular, train units are commonly maintained and cleaned at dedicated shunting yards. The planning problem that results from such situations is referred to as the Train Unit Shunting Problem (TUSP). This problem involves matching arriving train units to service tasks and determining the schedule for departing trains. The TUSP is an important problem as it is used to determine the capacity of shunting yards and arises as a sub-problem of more general scheduling and planning problems. In this paper, we consider the case of the Dutch Railways (NS) TUSP. As the TUSP is complex, NS currently uses a local search (LS) heuristic to determine if an instance of the TUSP has a feasible solution. Given the number of shunting yards and the size of the planning problems, improving the evaluation speed of the LS brings significant computational gain. In this work, we use a machine learning approach that complements the LS and accelerates the search process. We use a Deep Graph Convolutional Neural Network (DGCNN) model to predict the feasibility of solutions obtained during the run of the LS heuristic. We use this model to decide whether to continue or abort the search process. In this way, the computation time is used more efficiently as it is spent on instances that are more likely to be feasible. Using simulations based on real-life instances of the TUSP, we show how our approach improves upon the previous method on prediction accuracy and leads to computational gains for the decision-making process.
picture_as_pdf View full article
Player Vectors: Characterizing Soccer Players' Playing Style from Match Event Streams (701)
Tom Decroos (KU Leuven), Jesse Davis (KU Leuven)

Transfer fees for soccer players are at an all-time high. To make the most of their budget, soccer clubs need to understand the type of players they have and the type of players that are on the market. Current insights in the playing style of players are mostly based on the opinions of human soccer experts such as trainers and scouts. Unfortunately, their opinions are inherently subjective and thus prone to faults. In this paper, we characterize the playing style of a player in a more rigorous, objective and data-driven manner. We characterize the playing style of a player using a so-called `player vector' that can be interpreted both by human experts and machine learning systems. We demonstrate the validity of our approach by retrieving player identities from anonymized event stream data and present a number of use cases related to scouting and monitoring player development in top European competitions.
picture_as_pdf View full article
Shallow Self-Learning for Reject Inference in Credit Scoring (730)
Nikita Kozodoi (Humboldt University of Berlin; Kreditech, Hamburg), Panagiotis Katsas (Kreditech, Hamburg), Stefan Lessmann (Humboldt University of Berlin), Luis Moreira-Matias (Kreditech, Hamburg), Konstantinos Papakonstantinou (Kreditech, Hamburg)

Credit scoring models support loan approval decisions in the financial services industry. Lenders train these models on data from previously granted credit applications, where the borrowers' repayment behavior has been observed. This approach creates sample bias. The scoring model is trained on accepted cases only. Applying the model to screen applications from the population of all borrowers degrades its performance. Reject inference comprises techniques to overcome sampling bias through assigning labels to rejected cases. This paper makes two contributions. First, we propose a self-learning framework for reject inference. The framework is geared toward real-world credit scoring requirements through considering distinct training regimes for labeling and model training. Second, we introduce a new measure to assess the effectiveness of reject inference strategies. Our measure leverages domain knowledge to avoid artificial labeling of rejected cases during evaluation. We demonstrate this approach to offer a robust and operational assessment of reject inference. Experiments on a real-world credit scoring data set confirm the superiority of the suggested self-learning framework over previous reject inference strategies. We also find strong evidence in favor of the proposed evaluation measure assessing reject inference strategies more reliably, raising the performance of the eventual scoring model.
picture_as_pdf View full article
J3R: Joint Multi-task Learning of Ratings and Review Summaries for Explainable Recommendation (742)
Avinesh P.V.S. (Technische Universität Darmstadt), Yongli Ren (RMIT University), Christian M. Meyer (Technische Universität Darmstadt), Jeffrey Chan (RMIT University), Zhifeng Bao (RMIT University), Mark Sanderson (RMIT University)

We learn user preferences from ratings and reviews by using multi-task learning (MTL) of rating prediction and summarization of item reviews. Reviews of an item tend to describe detailed user preferences (e.g., the cast, genre, or screenplay of a movie). A summary of such a review or a rating describes an overall user experience of the item.Our objective is to learn latent vectors which are shared across rating prediction and review summary generation.Additionally, the learned latent vectors and the generated summary act as explanations for the recommendation.Our MTL-based approach J3R uses a multi-layer perceptron for rating prediction, combined with pointer-generator networks with attention mechanism for the summarization component.We provide empirical evidence for joint learning of rating prediction and summary generation being beneficial for recommendation by conductingexperiments on the Yelp dataset and six domains of the Amazon 5-core dataset. Additionally, we provide two ways of explanations visualizing (a) the user vectors on different topics of a domain, computed from our J3R approach and (b) a ten-word review summary of a review and the attention highlights generated on the review based on the user-item vectors.
picture_as_pdf View full article
Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge (743)
Lidia Contreras-Ochando (Universitat Politècnica de València), Cèsar Ferri (Universitat Politècnica de València), José Hernández-Orallo (Universitat Politècnica de València), Fernando Martínez-Plumed (Universitat Politècnica de València), María José Ramírez-Quintana (Universitat Politècnica de València), Susumu Katayama (University of Miyazaki)

Data quality is essential for database integration, machine learning and data science in general. Despite the increasing number of tools for data preparation, the most tedious tasks of data wrangling -and feature manipulation in particular- still resist automation partly because the problem strongly depends on domain information. For instance, if the strings "17th of August of 2017" and "2017-08-17" are to be formatted into "08/17/2017" to be properly recognised by a data analytics tool, humans usually process this in two steps: (1) they recognise that this is about dates and (2) they apply conversions that are specific to the date domain. However, the mechanisms to manipulate dates are very different from those to manipulate addresses. This requires huge amounts of background knowledge, which usually becomes a bottleneck as the diversity of domains and formats increases. In this paper we help alleviate this problem by using inductive programming (IP) with a dynamic background knowledge (BK) fuelled by a machine learning meta-model that selects the domain, the primitives (or both) from several descriptive features of the data wrangling problem. We illustrate these new alternatives for the automation of data format transformation, which we evaluate on an integrated benchmark and code for data wrangling, which we share publicly for the community.
picture_as_pdf View full article Reproducible Research
A Semi-Supervised and Online Learning Approach for Non-Intrusive Load Monitoring (844)
Hajer Salem (Institut Mines-Télécom Lille Douai; Manouba University), Moamar Sayed-Mouchaweh (Institut Mines-Télécom Lille Douai)

Non-Intrusive Load Monitoring (NILM) approaches aim at identifying the consumption of a single appliance from the total load provided by smart meters. Several research works based on Hidden Markov Models (HMM) were developed for NILM where training is performed offline. However, these approaches suffer from different issues: First, they fail to generalize to unseen appliances with different configurations or brands than the ones used for training. Second, obtaining data about all active states of each appliance requires long time, which is impractical for residents. Third, offline training requires storage of huge amount of data, yielding to share resident consumption data with external servers and causing privacy issues. Therefore, in this paper, a new approach is proposed in order to tackle these issues. This approach is based on the use of a HMM conditioned on discriminant contextual features (e.g., time of usage, duration of usage). The conditional HMM (CHMM) is trained online using data related to a single appliance consumption extracted from aggregated load in order to adapt its parameters to the appliance specificity's (e.g., brand, configuration, etc.). Experiments are performed using real data from publicly available data sets and comparative evaluation are performed on a publicly available NILM framework.
picture_as_pdf View full article
Compact Representation of a Multi-dimensional Combustion Manifold Using Deep Neural Networks (863)
Sushrut Bhalla (University of Waterloo), Matthew Yao (University of Waterloo), Jean-Pierre Hickey (University of Waterloo), Mark Crowley (University of Waterloo)

The computational challenges in turbulent combustion simulations stem from the physical complexities and multi-scale nature of the problem which make it intractable to compute scale-resolving simulations. For most engineering applications, the large scale separation between the flame (typically sub-millimeter scale) and the characteristic turbulent flow (typically centimeter or meter scale) allows us to evoke simplifying assumptions -such as done for the flamelet model- to pre-compute all the chemical reactions and map them to a low-order manifold. The resulting manifold is then tabulated and looked-up at run-time. As the physical complexity of combustion simulations increases (including radiation, soot formation, pressure variations etc.) the dimensionality of the resulting manifold grows which impedes an efficient tabulation and look-up. In this paper we present a novel approach to model the multi-dimensional combustion manifold. We approximate the combustion manifold using a neural network function approximator and use it to predict the temperature and composition of the reaction. We present a novel training procedure which is developed to generate a smooth output curve for temperature over the course of a reaction. We then evaluate our work against the current approach of tabulation with linear interpolation in combustion simulations. We also provide an ablation study of our training procedure in the context of over-fitting in our model. The combustion dataset used for the modeling of combustion of H2 and O2 in this work isreleased alongside this paper.
picture_as_pdf View full article
CASTNet: Community-Attentive Spatio-Temporal Networks for Opioid Overdose Forecasting (900)
Ali Mert Ertugrul (University of Pittsburgh; Middle East Technical University, Ankara), Yu-Ru Lin (University of Pittsburgh), Tugba Taskaya-Temizel (Middle East Technical University, Ankara)

Opioid overdose is a growing public health crisis in the United States. This crisis, recognized as "opioid epidemic," has widespread societal consequences including the degradation of health, and the increase in crime rates and family problems. To improve the overdose surveillance and to identify the areas in need of prevention effort, in this work, we focus on forecasting opioid overdose using real-time crime dynamics. Previous work identified various types of links between opioid use and criminal activities, such as financial motives and common causes. Motivated by these observations, we propose a novel spatio-temporal predictive model for opioid overdose forecasting by leveraging the spatio-temporal patterns of crime incidents. Our proposed model incorporates multi-head attentional networks to learn different representation subspaces of features. Such deep learning architecture, called "community-attentive" networks, allows the prediction for a given location to be optimized by a mixture of groups (i.e., communities) of regions. In addition, our proposed model allows for interpreting what features, from what communities, have more contributions to predicting local incidents as well as how these communities are captured through forecasting. Our results on two real-world overdose datasets indicate that our model achieves superior forecasting performance and provides meaningful interpretations in terms of spatio-temporal relationships between the dynamics of crime and that of opioid overdose.
picture_as_pdf View full article Reproducible Research
Dynamic Principal Projection for Cost-Sensitive Online Multi-Label Classification (J30)
Hong-Min Chu, Kuan-Hao Huang, Hsuan-Tien Lin


link View full article
Aggregating Algorithm for Prediction of Packs (J02)
Dmitry Adamskiy, Anthony Bellotti, Raisa Dzhamtyrova, Yuri Kalnishkan


link View full article
Efficient Feature Selection Using Shrinkage Estimators (J31)
Konstantinos Sechidis, Laura Azzimonti, Adam Pocock, Giorgio Corani, James Weatherall, Gavin Brown


link View full article
Grouped Gaussian Processes for Solar Power Prediction (J26)
Astrid Dahl, Edwin V. Bonilla


link View full article
LSALSA: Accelerated Source Separation via Learned Sparse Coding (J28)
Benjamin Cowen, Apoorva Nandini Saridena, Anna Choromanska


link View full article
Data scarcity, robustness and extreme multi-label classification (J29)
Rohit Babbar, Bernhard Schölkopf


link View full article
Joint Detection of Malicious Domains and Infected Clients (J19)
Paul Prasse, René Knaebel, Lukáš Machlica, Tomáš Pevný, Tobias Scheffer


link View full article
A Flexible Probabilistic Framework for Large-Margin Mixture of Experts (J08)
Archit Sharma, Siddhartha Saxena, Piyush Rai


link View full article
Deep Collective Matrix Factorization for Augmented Multi-View Learning (J05)
Ragunathan Mariappan, Vaibhav Rajan


link View full article
Temporal Pattern Attention for Multivariate Time Series Forecasting (J32)
Shun-Yao Shih, Fan-Keng Sun, Hung-yi Lee


link View full article
Compatible Natural Gradient Policy Search (J24)
Joni Pajarinen, Hong Linh Thai, Riad Akrour, Jan Peters, Gerhard Neumann


link View full article
TD-Regularized Actor-Critic Methods (J09)
Simone Parisi, Voot Tangkaratt, Jan Peters, Mohammad Emtiyaz Khan


link View full article
On PAC-Bayesian Bounds for Random Forests (J17)
Stephan S. Lorenzen, Christian Igel, Yevgeny Seldin


link View full article
Nuclear Discrepancy for Single-Shot Batch Active Learning (J18)
Tom J. Viering, Jesse H. Krijthe, Marco Loog


link View full article
Efficient learning with robust gradient descent (J14)
Matthew J. Holland, Kazushi Ikeda


link View full article
CaDET: Interpretable Parametric Conditional Density Estimation with Decision Trees and Forests (J07)
Cyrus Cousins, Matteo Riondato


link View full article
On the analysis of adaptability in multi-source domain adaptation (J15)
Ievgen Redko, Amaury Habrard, Marc Sebban


link View full article
The Teaching Size: Computable Teachers and Learners for Universal Languages (J16)
Jan Arne Telle, José Hernández-Orallo, Cèsar Ferri


link View full article
Distribution-Free Uncertainty Quantification for Kernel Methods by Gradient Perturbations (J20)
Balázs Cs. Csáji, Krisztián B. Kis


link View full article
Improving latent variable descriptiveness by modelling rather than ad-hoc factors (J06)
Alex Mansbridge, Roberto Fierimonte, Ilya Feige, David Barber


link View full article
Classification with Label Noise: A Markov Chain Sampling Framework (J21)
Zijin Zhao, Lingyang Chu, Dacheng Tao, Jian Pei


link View full article
Finding lasting dense graphs (J10)
Konstantinos Semertzidis, Evaggelia Pitoura, Evimaria Terzi, Panayiotis Tsaparas


link View full article
Model-Free Inference of Diffusion Networks (J11)
Shoubo Hu, Bogdan Cautis, Zhitang Chen, Laiwan Chan, Yanhui Geng, Xiuqiang He


link View full article
Thu, 15:40 - 16:00 @ 0.001 (Poster@Thu) Clustering
Noise-free Latent Block Model for High Dimensional Data (J25)
Charlotte Laclau, Vincent Brault


link View full article
Deeply Supervised Model for Click-Through Rate Prediction in Sponsored Search (J03)
Jelena Gligorijevic, Djordje Gligorijevic, Ivan Stojkovic, Xiao Bai, Amit Goyal, Zoran Obradovic


link View full article
Robust active attacks on social graphs (J13)
Sjouke Mauw, Yunior Ramírez-Cruz, Rolando Trujillo-Rasua


link View full article
Temporal Density Extrapolation using a Dynamic Basis Approach (J01)
G. Krempl, D. Lang, V. Hofer


link View full article
Counts-of-Counts Similarity for prediction and search in relational data (J12)
Manfred Jaeger, Marco Lippi, Giovanni Pellegrini, Andrea Passerini


link View full article
Mining Skypatterns in Fuzzy Tensors (J23)
Nicolas Nadisic, Aurélien Coussat, Loïc Cerf


link View full article
Multi-location Visibility Query Processing Using Portion-based Transactional Modeling and Pattern Mining (J22)
Lakshmi Gangumalla, P. Krishna Reddy, Anirban Mondal


link View full article
Stochastic Gradient Hamiltonian Monte Carlo with Variance reduction for Bayesian inference (J27)
Zhize Li, Tianyi Zhang, Shuyu Cheng, Jun Zhu, Jian Li


link View full article
Tue, 15:40 - 16:00 @ 1.011 (Poster@Wed) Ranking
Rankboost+: An Improvement to Rankboost (J04)
Harold Connamacher (Case Western Reserve University), Nikil Pancha (Case Western Reserve University), Rui Liu (Case Western Reserve University), Soumya Ray (Case Western Reserve University)


link View full article