MLDAS banner

2016 Symposium Invited Speakers' Abstracts

Data, Knowledge and Discovery: Machine Learning meets Natural Science
Hugh Durrant-Whyte - University of Sydney

Abstract: Increasingly it is data, vast amounts of data, that drives scientific discovery. At the heart of this so-called “fourth paradigm of science” is the rapid development of large scale statistical data fusion and machine learning methods. While these developments in “big data” methods are largely driven by commercial applications such as internet search or customer modelling, the opportunity for applying these to scientific discovery is huge. This talk will describe a number of applied machine learning projects addressing real-world inference problems in physical, life and social science areas. In particular, I will describe a major Science and Industry Endowment Fund (SIEF) project, in collaboration with the NICTA and Macquarie University, looking to apply machine learning techniques to discovery in the natural sciences. This talk will look at the key methods in machine learning that are being applied to the discovery process, especially in areas like geology, ecology and biological discovery.

Bio: Hugh Durrant-Whyte is a Professor and ARC Federation Fellow at the University of Sydney. From 2010-2014, he was CEO of National ICT Australia (NICTA), and from 1995-2010 Director of the ARC Centre of Excellence for Autonomous Systems and of the Australian Centre for Field Robotics (ACFR). He has published over 350 research papers and founded four successful start-up companies. He has won numerous awards and prizes for his work, including being named the 2008 Professional Engineer of the Year by the Institute of Engineers Australia Sydney Division and the 2010 NSW Scientist of the Year. He is a Fellow of the of the Australian Academy of Science (FAA), and a Fellow of the Royal Society of London (FRS).

Scaling log-linear analysis to datasets with thousands of variables
Geoff Webb - Monash University

Abstract: Association discovery is a fundamental data mining task. The primary statistical approach to association discovery between variables is log-linear analysis. Classical approaches to log-linear analysis do not scale beyond about ten variables. By melding the state-of-the-art in statistics, graphical modeling, and data mining research, we have developed efficient and effective algorithms for log-linear analysis, performing in seconds log-linear analysis of datasets with thousands of variables and providing a powerful statistically-sound method for creating compact models of complex high-dimensional multivariate distributions.

Bio: Geoff Webb is a leading data scientist. He was editor in chief of the premier data mining journal, Data Mining and Knowledge Discovery from 2005 to 2014. He has been Program Committee Chair of the two top data mining conferences, ACM SIGKDD and IEEE ICDM, as well as General Chair of ICDM. He is the Director of the Monash University Center for Data Science. He is a Technical Advisor to BigML Inc, who are incorporating his best of class association discovery software, Magnum Opus, into their cloud based Machine Learning service. He developed many of the key mechanisms of support-confidence association discovery in the late 1980s. His OPUS search algorithm remains the state-of-the-art in rule search. He pioneered multiple research areas as diverse as black-box user modelling, interactive data analytics and statistically-sound pattern discovery. He has developed many useful machine learning algorithms that are widely deployed. He received the 2013 IEEE Outstanding Service Award, a 2014 Australian Research Council Discovery Outstanding Researcher Award and is an IEEE Fellow.

Automating Clinical Evidence Synthesis via Machine Learning and Natural Language Processing
Byron Wallace - University of Texas - Austin

Abstract: Evidence-based medicine (EBM) looks to inform patient care with the totality of the available evidence. Systematic reviews, which statistically synthesize the entirety of the biomedical literature pertaining to a specific clinical question, are the cornerstone of EBM. These reviews are critical to modern healthcare, informing everything from national health policy to bedside decision-making. But conducting systematic reviews is extremely laborious and hence expensive. Producing a single review requires thousands of expert hours. Moreover, the exponential expansion of the biomedical literature base has imposed an unprecedented burden on reviewers, thus multiplying costs. Researchers can no longer keep up with the primary literature, and this hinders the practice of evidence-based care. I will discuss recent work on machine learning and natural language processing approaches that look to optimize the practice of EBM and thus mitigate the burden on reviewers. Specifically, I will describe a method for automatic identification of clinically salient information in full text articles (descriptions of the population, interventions and outcomes studied; collectively referred to as PICO elements). And I will describe work on semi-automating the important step of assessing clinical trials for risks of bias. These tasks pose challenging problems from a machine learning vantage point, motivating the development of novel approaches. For example, I will describe (1) a new framework for distantly supervised learning that we introduce for PICO identification, and, (2) a hierarchical multi-task learning approach motivated by our work on automating risk of bias assessments. I will present evaluations of these methods in the context of EBM. Finally, I will highlight promising directions moving forward toward automating evidence synthesis, including hybrid crowd-sourced/machine learning systems.

Arabesque: A System for Distributed Graph Mining
Marco Serafini - QCRI

Abstract: Distributed data processing platforms such as MapReduce and Pregel have substantially simplified the design and deployment of certain classes of distributed graph analytics algorithms. However, these platforms do not represent a good match for distributed graph mining problems, as for example finding frequent subgraphs in a graph. Given an input graph, these problems require exploring a very large number of subgraphs and finding patterns that match some “interestingness” criteria desired by the user. These algorithms are very important for areas such as social networks, semantic web, and bioinformatics. This talk will present Arabesque, the first distributed data processing platform for implementing graph mining algorithms. Arabesque automates the process of exploring a very large number of subgraphs. It defines a high-level filter-process computational model that simplifies the development of scalable graph mining algorithms: Arabesque explores subgraphs and passes them to the application, which must simply compute outputs and decide whether the subgraph should be further extended. The Arabesque’s API has been used to produce distributed solutions to three fundamental graph mining problems: frequent subgraph mining, counting motifs, and finding cliques. These implementations require a handful of lines of code, scale to trillions of subgraphs, and represent in some cases the first available distributed solutions.

Bio: Marco Serafini is a Scientist at the Qatar Computing Research Institute, where he develops programming abstractions and systems for scalable graph search, exploration, and mining. He also works on elasticity and load balancing for real-time distributed data management systems, as well as on distributed coordination. His work has appeared in major conferences such as VLDB, SOSP, NSDI, ICDE, and PODC. He serves or has served as PC member of VLDB, ICDE, Eurosys, ICDCS, and WWW, among others, and he co-chaired the PaPoC workshop, which is co-located with Eurosys. Before QCRI he was with Yahoo! Research, where he worked on the Zookeeper coordination system. Marco got his PhD from TU Darmstadt, Germany.

Analytics for Aircraft Health Maintenance
James Schimert and Rodney Tjoelker - Boeing Commercial Airplanes

Abstract: Large amounts of data are generated in the support and operation of aircraft. Maintenance activities, flight sensor data, and engineering data can be used to monitor and improve the operations and performance of aircraft. This talk describes challenges and solutions for analysis of messy text data from maintenance logs and how this helps create an airplane health monitoring solution. We also present challenges in the analysis of flight data from thousands of sensors on the aircraft to improve airplane health monitoring. These challenges present opportunities for future research and development.

Recommendation in Citation Networks
Mohammed Zaki - Rensselaer Polytechnic Institute

Abstract: Finding a relevant set of publications for a given topic of interest is a challenging problem. Researchers often use several queries on different bibliographic databases as well as various other means to collect such a set. We propose a two-stage query-dependent approach for retrieving relevant papers, authors, venues, and related topics given a keyword-based query. In the first stage, we utilize content similarity to select an initial seed set of publications, which we then augment by following the citation links which are weighted with information such as citation context relevance and age-based attenuation. In the second stage, we construct a multi-layer graph that expands the publications subgraph by including links to the authors, venues, and keywords. This allows us to return recommendations for entities that are both highly authoritative, and also textually related to the query. We show that our staged approach gives superior results on three different benchmark query sets.

Bio: Mohammed J. Zaki is a Professor of Computer Science at RPI. He was also a Principal Scientist at QCRI from 2013-2015. He received his Ph.D. degree in computer science from the University of Rochester in 1998. His research interests focus on developing novel data mining techniques, especially for applications in bioinformatics and social networks. He has published over 225 publications in data mining and bioinformatics, including the Data Mining and Analysis textbook published by Cambridge University Press, 2014. He is currently Area Editor for Statistical Analysis and Data Mining, and an Associate Editor for Data Mining and Knowledge Discovery, ACM Transactions on Knowledge Discovery from Data, and Social Networks and Mining. He was the program co-chair for SDM'08, SIGKDD'09, PAKDD'10, BIBM'11, CIKM'12, ICDM'12, and IEEE BigData'15. He is currently serving on the Board of Directors for ACM SIGKDD. He received an NSF CAREER Award in 2001 and US DOE Career Award in 2002. He is a senior member of the IEEE, and ACM Distinguished Scientist.