Below is the complete list of accepted papers; congratulations to the authors!
Papers accepted as regular talks or posters are published in the LNCS proceedings. Horizon papers are not published in the proceedings, but the authors are given the opportunity to give a longer talk at the conference.
The papers are listed in no particular order.
Regular talks
- A Bayesian Approach for Identifying Multivariate Differences Between Groups
Yuriy Sverchkov (University of Wisconsin–Madison) and Gregory Cooper (University of Pittsburgh)
Abstract
We present a novel approach to the problem of detecting multivariate statistical differences across groups of data. The need to compare data in a multivariate manner arises naturally in observational studies, randomized trials, comparative effectiveness research, abnormality and anomaly detection scenarios, and other application areas. In such comparisons, it is of interest to identify statistical differences across the groups being compared. The approach we present in this paper addresses this issue by constructing statistical models that describe the groups being compared and using a decomposable Bayesian Dirichlet score of the models to identify variables that behave statistically differently between the groups. In our evaluation, the new method performed significantly better than logistic lasso regression in indentifying differences in a variety of datasets under a variety of conditions.
- I-Louvain: an attributed graph clustering method
Christine Largeron (Université de Lyon), Mathias Géry (LaHC, Université Jean Monnet, Saint-Etienne (France)), David Combe (Laboratoire Hubert Curien) and Elöd Egyed-Zsigmond (LIRIS)
Abstract
Modularity allows to estimate the quality of a partition into communities of a graph composed of highly inter-connected vertices. In this article, we introduce a complementary measure, based on inertia, and specially conceived to evaluate the quality of a partition based on real attributes describing the vertices. We propose also I-Louvain, a graph nodes clustering method which uses our criterion, combined with Newman’s modularity, in order to detect communities in attributed graph where real attributes are associated with the vertices. Our experiments show that combining the relational information with the attributes allows to detect the communities more efficiently than using only one type of information. In addition, our method is more robust to data degradation.
- Continuous and Discrete Deep Classifiers for Data Integration
Nataliya Sokolovska (University Paris 6, INSERM U872)
Abstract
Data representation in a lower dimension is needed in applications, where information comes from multiple high dimensional sources. A final compact model has to be interpreted by human experts, and interpretation of a classifier whose weights are discrete is much more straightforward. In this contribution, we propose a novel approach, called Deep Kernel Dimensionality Reduction which is designed for learning layers of new compact data representations simultaneously. We show by experiments on standard and on real large-scale biomedical data sets that the proposed method embeds data in a new compact meaningful representation, and leads to a lower classification error compared to the state-of-the-art methods. We also consider some state-of-the art deep learners and their corresponding discrete classifiers. We illustrate by our experiments that although purely discrete models do not always perform better than real-valued classifiers, the trade-off between the model accuracy and the interpretability is quite reasonable.
- Diversity-driven Widening of Hierarchical Agglomerative Clustering
Alexander Fillbrunn (University of Konstanz) and Michael R. Berthold (University of Konstanz)
Abstract
In this paper we show that diversity-driven widening, the parallel exploration of the model space with focus on developing diverse models, can improve hierarchical agglomerative clustering. Depending on the selected linkage method, the model that is found through the widened search achieves a better silhouette coecient than its sequentially built counterpart.
- Batch Steepest-Descent-Mildest-Ascent for Interactive Maximum Margin Clustering
Fabian Gieseke (Radboud University Nijmegen), Tapio Pahikkala (University of Turku) and Tom Heskes (Radboud University Nijmegen)
Abstract
The maximum margin clustering principle extends support vector machines to unsupervised scenarios. We present a variant of this clustering scheme that can be used in the context of interactive clustering scenarios. In particular, our approach permits the class ratios to be manually defined by the user during the fitting process. Our framework can be used at early stages of the data mining process when no or very little information is given about the true clusters and class ratios. One of the key contributions is an adapted steepest-descent-mildest-ascent optimization scheme that can be used to fine-tune maximum margin clustering solutions in an interactive manner. We demonstrate the applicability of our approach in the context of remote sensing and astronomy with training sets consisting of hundreds of thousands of patterns.
- Segregation Discovery in a Social Network of Companies
Alessandro Baroni (Dipartimento di Informatica, Università di Pisa) and Salvatore Ruggieri (Dipartimento di Informatica, Università di Pisa)
Abstract
We introduce a framework for a data-driven analysis of segregation of minority groups in social networks, and challenge it on a complex scenario. The framework builds on quantitative measures of segregation, called segregation indexes, proposed in the social science literature. The segregation discovery problem is introduced, which consists of searching sub-graph and sub-groups for which a segregation index is above a minimum threshold. A search algorithm is devised that solves the segregation problem. The framework is challenged on the analysis of segregation in the real and large network of the Italian companies connected through shared directors in their boards.
- Probabilistic Active Learning in Data Streams
Daniel Kottke (Knowledge Management and Discovery Lab, University Magdeburg), Georg Krempl (Knowledge Management and Discovery Lab, University Magdeburg) and Myra Spiliopoulou (Knowledge Management and Discovery Lab, University Magdeburg)
Abstract
In recent years, stream-based active learning has become an intensively investigated research topic. In this work, we propose a new algorithm for stream-based active learning that decides immediately whether to acquire a label (selective sampling). It uses Probabilistic Active Learning (PAL) to measure the spatial usefulness of each instance in the stream. To determine if a currently arrived instance belongs to the most useful instances (temporal usefulness) given a predefined budget, we propose BIQF – a Balanced Incremental Quantile Filter. It uses a sliding window to represent the distribution of the most recent usefulness values and finds a labeling threshold using quantiles. The balancing mechanism ensures that the predefined budget will be met within a given tolerance window. We evaluate our approach against other stream active learning approaches on multiple datasets. The results confirm the effectiveness of our method.
- Fast Algorithm Selection using Learning Curves
Jan N. van Rijn (Leiden University), Salisu Mamman Abdulrahman (University of Porto), Pavel Brazdil (University of Porto) and Joaquin Vanschoren (K.U.Leuven)
Abstract
One of the challenges in Machine Learning is given a dataset, find a classifier that works well on it. Performing a cross-validation evaluation procedure on all possible combinations typically takes too much time; many solutions have been proposed that attempt to predict which classifiers are most promising to try. As the first recommended classifier is not always the correct choice, multiple recommendations should be made, making this a ranking problem rather than a classification problem. Even though this is a well studied problem, there is currently no good way of evaluating such rankings. We advocate the use of Loss Time Curves, as used in optimization literature. These visualize the amount of budget (time) needed to converge to a acceptable solution. We investigate a method that utilizes the measured performances of classifiers on small samples of data to make such recommendation, and adapt it so that it works well in Loss Time space. Experimental results show that this method converges extremely fast to an acceptable solution.
- Diagonal Co-clustering Algorithm for Document-Word Partitioning
Charlotte Laclau (LIPADE, Universite Paris Descartes) and Mohamed Nadif (LIPADE, Universite Paris Descartes)
Abstract
We propose a novel diagonal co-clustering algorithm built upon the double Kmeans to address the problem of document-word co-clustering. At each iteration, the proposed algorithm seeks for a diagonal block structure of the data by minimizing a criterion based on the variance within and the centroid effect. In addition to be easy-to-interpret and efficient on sparse binary and continuous data, Diagonal Double Kmeans (DDKM) is also faster than other state-of-the art clustering algorithms. We illustrate our contribution using real datasets commonly used in document clustering.
- Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data
Georgios Balikas (University Grenoble Alpes), Ioannis Partalas (University Grenoble Alpes – LIG), Eric Gaussier (University Grenoble Alpes – LIG), Rohit Babbar (Max Plank Institute for Intelligent Systems) and Massih-Reza Amini (University Grenoble Alpes – LIG)
Abstract
Hyper-parameter tuning is a resource-intensive task when optimizing classification models. The commonly used k-fold cross validation can become intractable in large scale settings when a classifier has to learn billions of parameters. At the same time, in real-world, one often encounters multi-class classification scenarios with only a few labeled examples; model selection approaches often offer little improvement in such cases and the default values of learners are used. We propose bounds for classification on accuracy and macro measures (precision, recall, F1) that motivate efficient schemes for model selection and can benefit from the existence of unlabeled data. We demonstrate the advantages of those schemes by comparing them with k-fold cross validation and hold-out estimation in the setting of large scale classification.
- Automatically Discovering Offensive Patterns in Soccer Match Data
Jan Van Haaren (KU Leuven), Vladimir Dzyuba (KU Leuven), Siebe Hannosset (Club Brugge KV) and Jesse Davis (KU Leuven)
Abstract
In recent years, many professional sports clubs have adopted camera-based tracking technology that captures the location of both the players and the ball at a high frequency. Nevertheless, the valuable information that is hidden in these performance data is rarely used in their decision-making process. What is missing are the computational methods to analyze these data in great depth. This paper addresses the task of automatically discovering patterns in offensive strategies in professional soccer matches. To address this task, we propose an inductive logic programming approach that can easily deal with the relational structure of the data. An experimental study shows the utility of our approach.
- VoQs: A Web Application for Visualization of Questionnaire Surveys
Xiaowei Zhang (Helmholtz Centre for Infection Research), Frank Klawonn (Helmholtz Centre for Infection Research), Lorenz Grigull (Medical University Hanover) and Werner Lechner (Improved Medical Diagnostics, Pte. Ltd.)
Abstract
This paper is motivated by analyzing questionnaire data that collected from patients who suffer from an orphan disease. In order to decrease misdiagnoses and shorten the diagnosis time for those patients who have not yet been diagnosed but have a long history of health problems, a research project about using questionnaire mode and data analysis methods to predetermine orphan disease has been set up and questionnaires were designed based on experiences with patients who already have a diagnosis. The main focus of this work is to visualize answering patterns that characterize patients with a specific orphan disease, which questions are most useful to distinguish between certain orphan diseases and how well an answering pattern of a specific patient fits to the general pattern of those patients who share the same disease. We borrow from the concept of sequence logos, commonly used in genetics to visualize the conservation of nucleotides in a strand of DNA or RNA. Instead of nucleotides, we have possible answers from a question. Our proposed visualization techniques are not limited to questionnaires on orphan diseases but also can be applied to any questionnaire survey with closed-ended questions for which we are interested in answering characteristics of different groups.
Regular posters
- Implicitly Constrained Semi-Supervised Least Squares Classification
Jesse H. Krijthe (Delft University of Technology/Leiden University) and Marco Loog (Delft University of Technology/University of Copenhagen)
Abstract
We introduce a novel semi-supervised version of the least squares classifier.This implicitly constrained least squares (ICLS) classifier minimizes the squared loss on the labeled data among the set of parameters implied by all possible labelings of the unlabeled data.Unlike other discriminative semi-supervised methods, our approach does not introduce explicit additional assumptions into the objective function, but leverages implicit assumptions already present in the choice of the supervised least squares classifier. We show this approach can be formulated as a quadratic programming problem and its solution can be found using a simple gradient descent procedure. We prove that, in a certain way, our method never leads to performance worse than the supervised classifier. Experimental results corroborate this theoretical result in the multidimensional case on benchmark datasets, also in terms of the error rate.
- A Comparative Study of Clustering Methods with Multinomial Distribution
Md Abul Hasnat (ERIC Lab, University Lyon 2), Julien Velcin (ERIC Lab, University Lyon 2), Stéphane Bonnevay (University Lyon 1) and Julien Jacques (ERIC Lab, University Lyon 2)
Abstract
In this paper, we study different discrete data clustering methods which employ the Model-Based Clustering (MBC) with the Multinomial distribution. We propose a novel MBC method by efficiently combining the partitional and hierarchical clustering techniques. We conduct experiments on both synthetic and real data and evaluate the methods using accuracy, stability and computation time. Our study identifies appropriate strategies that should be used for different sub-tasks for unsupervised discrete data analysis with the MBC method.
- A parallel distributed processing algorithm for image feature extraction
Alexander Belousov (Ariel University) and Joel Ratsaby (Ariel University)
Abstract
We present a new parallel algorithm for image feature extraction that uses a recently introduced distance function (UID) between images. It is based on the LZ-complexity of the string representation of the two images. An input image is represented by a feature vector whose components are the distance values between its parts (sub-images) and a set of prototypes. The algorithm is highly scalable and computes these values in parallel. We implement it on a massively parallel graphics processing unit (GPU) with several thousands of cores. The algorithm distributes the computation of distances of thousands of sub-images and prototypes in parallel. This yields a very high increase in computational speeds for processing images, typically reducing the processing time from hours to seconds. Given a corpus of input images it produces a table of labeled cases that can be used by any supervised or unsupervised learning algorithm to learn image classification or image clustering. A main advantage is the lack of need for any image processing or image analysis; the user only once defines image-features through a simple basic process of choosing a few small images to serve as prototypes. Results for several image classification problems are presented.
- Exploratory topic modeling with distributional semantics
Samuel Rönnqvist (Turku Centre for Computer Science)
Abstract
As we continue to collect and store textual data in a multitude of domains, we are regularly confronted with material whose largely unknown thematic structure we want to uncover. With unsupervised, exploratory analysis, no prior knowledge about the content is required and highly open-ended tasks can be supported. In the past few years, probabilistic topic modeling has emerged as a popular approach to this problem. Nevertheless, the representation of the latent topics as aggregations of semi-coherent terms limits their interpretability and level of detail. This paper presents an alternative approach to topic modeling that maps topics as a network for exploration, based on distributional semantics using learned word vectors. From the granular level of terms and their semantic similarity relations global topic structures emerge as clustered regions and gradients of concepts. Moreover, the paper discusses the visual interactive representation of the topic map, which plays an important role in supporting its exploration.
- Constraint-Based Querying for Bayesian Network Exploration
Behrouz Babaki (KU Leuven), Tias Guns (KU Leuven), Siegfried Nijssen (KU Leuven) and Luc De Raedt (KU Leuven)
Abstract
Understanding the knowledge that resides in a Bayesian network can be hard, certainly when a large network is to be used for the first time, or when the network is complex or has just been updated. Tools to assist users in the analysis of Bayesian networks can help. In this paper, we introduce a novel general framework and tool for answering exploratory queries over Bayesian networks. The framework is inspired by queries from the constraint-based mining literature designed for the exploratory analysis of data. Adapted to Bayesian networks, these queries specify a set of constraints on explanations of interest, where an explanation is an assignment to a subset of variables in a network. Characteristic for the methodology is that it searches over different subsets of the explanations, corresponding to different marginalizations. A general purpose framework, based on principles of constraint programming, data mining and knowledge compilation, is used to answer all possible queries. This CP4BN framework employs a rich set of constraints and is able to emulate a range of existing queries from both the Bayesian network and the constraint-based data mining literature.
- Using entropy as a measure of acceptance for multi-label classification
Laurence Park (University of Western Sydney) and Simeon Simoff (University of Western Sydney)
Abstract
Multi-label classifiers allow us to predict the state of a set of responses using a single model. A multi-label model is able to make use of the correlation between the labels to potentially increase the accuracy of its prediction. Critical applications of multi-label classifiers (such as medical diagnoses) require that the system’s confidence in prediction also be provided with the multi-label prediction. The specialist then uses the measure of confidence to assess whether to accept the system’s prediction. Probabilistic multi-label classification provides a categorical distribution over the set of responses, allowing us to observe the distribution, select the most probable response, and obtain an indication of confidence by the shape of the distribution. In this article, we examine if normalised entropy, a parameter of the probabilistic multi-label response distribution, correlates with the accuracy of the prediction and therefore can be used to gauge confidence in the system’s prediction. We found that for all three methods examined on each data set, the accuracy increases for the majority of the observations where the normalised entropy threshold decreases, showing that we can use normalised entropy to gauge a systems confidence, and hence use it as a measure of acceptance.
- Investigation of Node Deletion Techniques for Clustering Applications of Growing Self Organizing Maps
Thilina Rathnayake (University of Moratuwa), Maheshakya Wijewardena (University of Moratuwa), Thimal Kempitiya (University of Moratuwa), Kevin Rathnasekara (University of Moratuwa), Thushan Ganegedara (University of Moratuwa), Amal Perera (University of Moratuwa) and Damminda Alahakoon (La Trobe University)
Abstract
Self Organizing Maps (SOM) are widely used in data mining and high-dimensional data visualization due to its unsupervised nature and robustness. Growing Self Organizing Maps (GSOM) is a variant of SOM algorithm which allows the nodes to grow so that it can represent the input space better. Without using a fixed 2D grid like SOM, GSOM starts with four nodes and keeps track of the quantization error in each node. New nodes are grown from an existing node if its error value exceeds a pre-defined threshold. Ability of the GSOM algorithm to represent input space and accurately cluster the input instances (vectors) is vital to extend its applicability to a wider spectrum of problems. This ability can be improved by identifying nodes that represent low probability regions in the input space and removing them periodically. This will improve the homogeneity and completeness of the final clustering result. A new extension to GSOM algorithm based on node deletion is proposed in this paper as a solution to this problem. Furthermore, two new algorithms inspired by cache replacement policies are presented. First algorithm is based on Adaptive Replacement Cache (ARC) and maintains two separate Least Recently Used (LRU) lists of the nodes. Second algorithm is built on Frequency Based Replacement policy (FBR) and maintains a single LRU list. These algorithms consider both recent and frequent trends in the GSOM grid before deciding on the nodes to be deleted. The experiments conducted suggest that the FBR based method for node deletion outperforms the naive algorithm and other existing node deletion methods.
- Using Metalearning for Prediction of Taxi Trip Duration Using Different Granularity Levels
Mohammad Nozari Zarmehri (INESC Porto) and Carlos Soares (University of Porto)
Abstract
Trip duration is an important metric for the management of taxi companies, as it affects operational efficiency, driver satisfaction and, above all, customer satisfaction. In particular, the ability to predict trip duration in advance can be very useful for allocating taxis to stands and finding the best route for trips. A data mining approach can be used to generate models for trip time prediction. In fact, given the amount of data available, different models can be generated for different taxis. Given the difference between the data collected by different taxis, the best model for each one can be obtained with different algorithms and/or parameter settings. However, finding the configuration that generates the best model for each taxi is computationally very expensive. In this paper, we propose the use of metalearning to address the problem of selecting the algorithm that generates the model with the most accurate predictions for each taxi. The approach is tested on data collected in the Drive-In project. Our results show that metalearning can help to select the algorithm with the best accuracy.
- A first-order-logic based model for grounded language learning
Leonor Becerra-Bonache (Université Jean Monnet), Hendrik Blockeel (K.U. Leuven), Maria Galvan (Université Jean Monnet) and François Jacquenet (Université Jean Monnet)
Abstract
Much is still unknown about how children learn language, but it is clear that they perform “grounded” language learning: they learn the grammar and vocabulary not just from examples of sentences, but from examples of sentences in a particular context. Grounded language learning has been the subject of much research. Most of this work focuses on particular aspects, such as constructing semantic parsers, or on particular types of applications. In this paper, we take a broader view that includes an aspect that has received little attention until now: learning the meaning of phrases from phrase/context pairs in which the phrase’s meaning is not explicitly represented. We propose a simple model for this task that uses first-order logic representations for contexts and meanings, including a simple incremental learning algorithm. We experimentally demonstrate that the proposed model can explain the gradual learning of simple concepts and language structure, and that it can easily be used for interpretation, generation, and translation of phrases.
- Slower can be faster: The iRetis incremental model tree learner
Denny Verbeeck (KULeuven) and Hendrik Blockeel (K.U. Leuven)
Abstract
Incremental learning is useful for processing streaming data, where data elements are produced at a high rate and cannot be stored. An incremental learner typically updates its model with each new instance that arrives. To avoid skipped instances, the model update must finish before the next element arrives, so it should be fast. However, there can be a trade-off between the efficiency of the update and how many updates are needed to get a good model. We investigate this trade-off in the context of model trees. We compare FIMT, a state-of-the-art incremental model tree learner developed for streaming data, with two alternative methods that use a more expensive update method. We find that for data with relatively low (but still realistic) dimensionality, the most expensive method often yields the best learning curve: the system converges faster to a smaller and more accurate model tree.
- Modeling concept drift: A probabilistic graphical model based approach
Hanen Borchani (Aalborg University), Ana M. Martinez (Aalborg University), Andres Masegosa (The Norwegian University of Science and Technology), Helge Langseth (The Norwegian University of Science and Technology), Thomas D. Nielsen (Aalborg University), Antonio Salmeron (University of Almeria), Antonio Fernandez (Banco de Credito Cooperativo), Anders L. Madsen (HUGIN EXPERT A/S) and Ramon Saez (Banco de Credito Cooperativo)
Abstract
An often used approach for detecting and adapting to concept drift when doing classification is to treat the data as i.i.d. and use changes in classification accuracy as an indication of concept drift. In this paper, we take a different perspective, and propose a framework, based on probabilistic graphical models, that explicitly represents concept drift using latent variables. To ensure efficient inference and learning, we resort to a variational Bayes inference scheme. As a proof of concept, we demonstrate and analyze the proposed framework using synthetic data sets as well as a real financial data set from a Spanish bank.
- Optimally Weighted Cluster Kriging for Big Data Regression
Bas van Stein (Leiden University), Hao Wang (Leiden University), Wojtek Kowalczyk (Leiden University), Michael Emmerich (Leiden University) and Thomas Bäck (Leiden University)
Abstract
In business and academia we are continuously trying to model and analyze complex processes in order to gain insight and optimize. One of the most popular modeling algorithms is Kriging, or Gaussian Processes. A major bottleneck with Kriging is the amount of processing time of at least O(n^3) and memory required O(n^2) when applying this algorithm on medium to big data sets. With big data sets, that are more and more available these days, Kriging is not computationally feasible. As a solution to this problem we introduce a hybrid approach in which a number of Kriging models built on disjoint sub sets of the data are properly weighted for the predictions. The proposed model is both in processing time and memory much more efficient than standard Global Kriging and performs equally well in terms of accuracy. The proposed algorithm is better scalable, and well suited for parallelization.
- Assigning Geo-Relevance of Sentiments Mined from Location-Based Social Media Posts
Randall Sanborn (University of Michigan – Flint), Michael Farmer (University of Michigan – Flint) and Syagnik Banerjee (University of Michigan – Flint)
Abstract
Broad adoption of smartphones has increased the number of posts generated while people are going about their daily lives. Many of these posts are related to the location where that post is generated. Being able to infer a person’s sentiment toward a given location would be a boon to market researchers. The large percentage of system-generated content in these posts posed difficulties for calculating sentiment and assigning that sentiment to the location associated with the post. Consequently our proposed system implements a sequence of text cleaning functions which was completed with a naive Bayes classifier to determine if a post was more or less likely to be associated with an individual’s present location. The system was tested on set of nearly 30,000 posts from Foursquare that had been cross-posted to Twitter which resulted in reasonable precision but with a large number of posts discarded.
- On Binary Reduction of Large-scale Multiclass Classification Problems
Bikash Joshi (University Grenoble Alpes – LIG), Massih-Reza Amini (University Grenoble Alpes – LIG), Ioannis Partalas (University Grenoble Alpes – LIG), Liva Ralaivola (University Aix-Marseille – LIF), Nicolas Usunier (Fracebook Research) and Eric Gaussier (University Grenoble Alpes – LIG)
Abstract
In the context of large-scale problems, traditional multiclass classification approaches have to deal with class imbalancement and complexity issues which make them inoperative in some extreme cases. In this paper we study a transformation that reduces the initial multiclass classification of examples into a binary classification of pairs of examples and classes. We present generalization error bounds that exhibit the interdependency between the pairs of examples and which recover known results on binary classification with i.i.d. data. We show the efficiency of the deduced algorithm compared to state-of-the-art multiclass classification strategies on two large-scale document collections especially in the interesting case where the number of classes becomes very large.
- Class-based outlier detection: staying zombies or awaiting for resurrection?
Leona Nezvalova (KD Lab, Faculty of Informatics, Masaryk University), Lubos Popelinsky (KD Lab, Faculty of Informatics, Masaryk University), Luis Torgo (F.Sci. U.Porto) and Karel Vaculik (KD Lab, Faculty of Informatics, Masaryk University)
Abstract
This paper addresses the task of finding outliers within each class in the context of supervised classification problems. Class-based outliers are cases that deviate too much with respect to the cases of the same class. We introduce a novel method for outlier detection in labelled data based on Random Forests and compare it with the existing methods both on artificial and real-world data. We show that it is competitive with the existing methods and sometimes gives more intuitive results. We also provide an overview for outlier detection in labelled data. The main contribution are two methods for class-based outlier description and interpretation.
- Data analytics and optimisation for assessing a ride sharing system
Vincent Armant (The Insight Center for Data Analytics), John Horan (Insight Centre), Nahid Mahbub (The Insight Center for Data Analytics) and Ken Brown (University College Cork)
Abstract
Ride-sharing schemes attempt to reduce road traffic by matching prospective passengers to drivers with spare seats in their cars. To be successful, such schemes require a critical mass of drivers and passengers. In current deployed implementations, the possible matches are based on heuristics, rather than real route times or distances. In some cases, the heuristics propose infeasible matches; in others, feasible matches are omitted. Poor ride matching is likely to deter participants from using the system. We develop a constraint-based model for acceptable ride matches which incorporates route plans and time windows. Through data analytics on a history of advertised schedules and agreed shared trips, we infer parameters for this model that account for 90% of agreed trips. By applying the inferred model to the advertised schedules, we demonstrate that there is an imbalance between riders and passengers. We assess the potential benefits of persuading existing drivers to switch to becoming passengers if appropriate matches can be found, by solving the inferred model with and without switching. We demonstrate that flexible participation has the potential to reduce the number of unmatched participants by up to 80%.
- Time Series Classification with Representation Ensembles
Rafael Giusti (Universidde de São Paulo), Diego Silva (USP) and Gustavo Batista (USP)
Abstract
Time series has attracted an enormous amount of attention in recent years, with thousands of methods in diverse tasks such as classification, clustering, prediction, and anomaly and motif detection. Among all these tasks, classification is likely to be the most prominent task, accounting for most of the applications and attention from the research community. However, in spite of the huge number of methods available, there is a significant body of empirical evidence indicating that the 1-nearest neighbor algorithm in the time domain is “extremely difficult to beat”. In this paper, we evaluate the use of different data representations in time series classification. Our work is motivated by methods used in related areas such as signal processing and music retrieval. In these areas, a change of representation frequently reveals features that are not apparent in the original data representation. The approach consists of using different representations such as frequency, wavelets, and autocorrelation to transform the time series into alternative decision spaces. A classifier is then used to provide a classification for each test time series in the alternative domain. We investigate how features provided in different domains can help in time series classification. We also experiment with different ensembles to investigate if the data representations are a good source of diversity for time series classification. Our extensive experimental evaluation approaches the issue of combining sets of representations and ensemble strategies, resulting in over $300$ ensemble configurations.
Horizon talks
- Towards a Data Science Collaboratory
Joaquin Vanschoren (K.U.Leuven), Bernd Bischl (Ludwig-Maximilians-University Munich), Frank Hutter (Albert-Ludwigs-Universitat Freiburg), Michele Sebag (Universite Paris Sud), Balazs Kegl (Universite Paris Sud), Matthias Schmid (Universitat Bonn), Giulio Napolitano (Universitat Bonn), Katy Wolstencroft (Leiden University), Alan R. Williams (University of Manchester) and Neil Lawrence (University of Sheffield)
Abstract
Data-driven research requires many people from different domains to collaborate efficiently. The domain scientist collects and analyzes scientific data, the data scientist develops new techniques, and the tool developer implements, optimizes and maintains existing techniques to be used throughout science and industry. Today, however, this data science expertise lies fragmented in loosely connected communities and scattered over many people, making it very hard to find the right expertise, data and tools at the right time. Collaborations are typically small and cross-domain knowledge transfer through the literature is slow. Although progress has been made, it is far from easy for one to build on the latest results of the other and collaborate effortlessly across domains. This slows down data-driven research and innovation, drives up costs and exacerbates the risks associated with the inappropriate use of data science techniques. We propose to create an open, online collaboration platform, a `collaboratory’ for data-driven research, that brings together data scientists, domain scientists and tool developers on the same platform. It will enable data scientists to evaluate their latest techniques on many current scientific datasets, allow domain scientists to discover which techniques work best on their data, and engage tool developers to share in the latest developments. It will change the scale of collaborations from small to potentially massive, and from periodic to real-time. This will be an inclusive movement operating across academia, healthcare, and industry, and empower more students to engage in data science.
- When Learning Indeed Changes the World: Diagnosing Prediction-Induced Drift
Georg Krempl (Knowledge Management and Discovery Lab, University Magdeburg), David Bodnar (University Magdeburg) and Anita Hrubos (University Magdeburg)
Abstract
A fundamental assumption underlying many prediction systems is that they act as an invisible observer without interfering with the evolution of the population under study. More formally, distributions are assumed to be independent of the system’s previous predictions. Nevertheless, this is violated when the predictor faces intelligent and malevolent adversaries who counteract the classification rules, or when the classification as high-risk might become a self-fulfilling prophecy. We present a unified framework to study self-defeating and self-fulfilling prophecies and to assess such prediction-induced drift. Our results on synthetic datasets are promising but highlight two major challenges: First, while the majority of prediction-induced drifts is correctly detected, detection of self-fulfilling prophecies seems more difficult, requiring further research. Second, this analysis requires knowing the labels assigned by prediction systems, which are currently not published in real-world datasets. Thus, this presentation aims to motivate the community to collect, share and analyse such data on prediction-induced drift.
- The Data Problem in Data Mining
Albrecht Zimmermann
Abstract
Computer science is essentially an applied or engineering science, creating tools. In Data Mining, those tools are supposed to help humans understand large amounts of data, and produce actionable insight. In this talk, I argue that for all the progress that has been made in Data Mining, in particular Pattern Mining, we are lacking understanding of key aspects of the performance and results of pattern mining algorithms. I will focus particularly on the difficulty of deriving actionable knowledge from patterns. I trace the lack of progress regarding those questions to a lack of data with varying, controlled properties, and argue that we will need to make a science of digital data generation, and use it to develop guidance to data practitioners.