Computational problems in mining urban data
Aristides Gionis - Aalto University
Abstract: With the fast growth of smart devices and sensor networks, large amounts of data are collected recording location, activity, and mobility of people living in urban environments. Additionally, data generated on location-aware social media provide rich information about places where people spend their time (shopping malls, cafés, parks, etc). The availability of this type of data provides novel opportunities for developing methods for extracting interesting patterns, detecting trends, modeling people's behavior, and eventually building intelligent systems that improve the interaction of citizens with their cities and help them to utilize better the available resources. In this talk we will review some of our work in the area of mining urban data. We formulate and discuss computational problems motivated by applications in detecting events, mining trajectories, modeling city neighborhoods, and recommending locations for groups.
Bio: Aristides Gionis is a professor in the department of Computer Science in Aalto University. Previously he has been a senior research scientist in Yahoo! Research. He is currently serving as an action editor in the Data Management and Knowledge Discovery journal (DMKD), an associate editor in the ACM Transactions on Knowledge Discovery from Data (TKDD), and a managing editor in Internet Mathematics. He has made contributions in several areas of data science, such as graph mining, social-media analysis, web mining, data clustering, and privacy-preserving data mining.
BreathPrint: Breathing Acoustics Based User Authentication
Aruna Prasad Seneviratne - University of New South Wales
Abstract: This talk will present BreathPrint, a new behavioural biometric signature based on audio features derived from an individual’s commonplace breathing gestures. BreathPrint uses the audio signatures associated with the three individual gestures: sniff, normal, and deep breathing, which are sufficiently different across individuals. Using these three breathing gestures, we will describe how a processing pipeline can be developed to identify users via the microphone on a smartphone and wearable device. We show that users can be authenticated reliably with an accuracy of over 94% for all the three breathing gestures in intra-sessions and deep breathing gesture provides the best overall balance between true positives (successful authentication) and false positives (resiliency to directed impersonation and replay attacks). Moreover, we show that this breathing sound based biometric is robust to some typical changes in both physiological and environmental context, and that it can be applied on multiple smartphone. platforms. In addition we will show the feasibility of using RNNs for an end-to-end authentication system based on breathing acoustics.
Bio: Aruna Seneviratne is the Foundation Chair of Telecommunications at the University of New South Wales (Australia) where he holds the Mahanakorn Chair of Telecommunications. Prior to that he.was the Director of the Cyber Physical Systems research program at Data61, CSIRO and the Director of the Australian Technology Park Laboratory of NICTA, Australia’s Information and Communications Technology (ICT) Centre of Excellence.
He has also worked at a number of other Universities in Australia, UK and France, and industrial organizations, including Avaya Labs and Telstra. In addition, he has held visiting appointments at INIRA (France) and has been awarded a number of fellowships including one at the British Telecom and one at Telecom Australia Research Labs.
His area of research is in physical analytics, which focuses on the analytics on the physical actions of humans and machines. He is particularly interested in how physical analytics can be used to guarantee security and privacy of users. He has published over 200 refereed technical papers, and book chapters. He has supervised 40 Ph.D. dissertations. In his research has made fundamental contributions to the understanding of mobility in IP networks, and the design and deployment of next generation of internet. He Has a PhD form the University of Bath, UK and is a Fellow of Australian Institute of Engineers
Some Directions for Exploration
Benjamin Van Roy - Stanford University
Abstract: Most machine learning work focusses on extracting value from data. The topic of exploration addresses data acquisition. Deliberately gathering "smart data" rather than relying on passively accumulated "big data" can make an enormous difference. This talk will discuss a few directions of ongoing research:
+ Information-Theoretic Foundations for Exploration Active Learning with
+ Neural Networks Deep Exploration in Reinforcement Learning Coordinated
+ Exploration in Concurrent Reinforcement Learning
Benjamin Van Roy is a Professor at Stanford University, where he has served on the faculty since 1998. He is an INFORMS Fellow, serves as the Learning Theory Area Editor of Mathematics for Operations Research and has served as the Financial Engineering Area Editor for Operations Research. He has also served on the editorial boards of Machine Learning and the INFORMS Journal on Optimization. He has led research programs at several technology companies, including Unica (acquired by IBM), Enuvis (acquired by SiRF), and Morgan Stanley. He received his SB, SM, and PhD from MIT.
Wavelet-based Scaling Indices for Machine Learning
Branislav Vidakovic - Georgia Institute of Technology
Abstract: Massive data sets, functional data, and high-frequency sampled processes intrinsically invariant to changes in scale are routinely observed and stored. A catchphrase ``scaling is omnipresent" certainly holds for many high frequency time series and high-resolution images. General multiscale domains provide an environment for analyzing, describing, and modeling data that scale, and for unifying several related attributes describing regularity, fractality, multi-fractality, self-similarity, and long memory.
n this talk we focus on the wavelet-based estimation of scaling indices. In particular, we focus on nondecimated, scale-mixing, complex-valued decompositions. They result in a hierarchy of imbedded multiresolution subspaces that lead to a multiscale spectra. Like in the Fourier transforms, where the rate of linear decay of the log-power spectra over the frequencies characterizes the regularity/smoothness of a time series/image, the slopes in the regression of the log-averages of squared wavelet coefficients on the scale index, lead to alternative and arguably more local and stable descriptors of signal/image regularity. Such descriptors, as arguments in ML procedures provide an additional discriminatory power.
We discuss examples from medical diagnostics, genetics, finance, and geosciences in which the scaling indices turn out to be useful in tasks of supervised learning. In the talk we also overview some interesting results from the ongoing research of the speaker and his team. We will point out at several avenues for possible future research. The talk is aimed at nonspecialists in wavelets, and a broader scientific audience.
Brani Vidakovic is Professor of Statistics at Georgia Institute of Technology, Atlanta, GA. At Georgia Tech he is affiliated with the School of Industrial and Systems Engineering and the Department of Biomedical Engineering. He is also jointly appointed professor with the school of Public Health at Emory University.
Dr. Vidakovic holds BS and MS degrees in mathematics from Belgrade University, Serbia, and Ph. D. degree in statistics from Purdue University. He was an Assistant and Associate Professor of Statistics at Duke University prior to joining Georgia Tech.
Research interest of Dr Vidakovic include Bayesian statistics, statistical modeling in wavelet domains, statistical analysis of signals/images, as well as geoscientific and biomedical statistical applications.
Dr. Vidakovic authored numerous journal articles, published a monograph in statistical use of wavelets, as well as several textbooks and edited volumes. He served as an editor in-chief for Wiley's Second Edition of Encyclopedia of Statistical Sciences.
Dr. Vidakovic is a member of several professional societies, fellow of American Statistical Association, and an elected member of International Statistical Institute.
Learning from Time Series Sensor Data
David Hallac - Stanford University
Abstract: Many applications, ranging from automobiles to financial markets and wearable sensors, generate large amounts of time series data. In most cases, this data is multivariate and heterogeneous, where the readings come from various types of entities, or sensors. These time series datasets are often sparse, unlabeled, dynamic, and difficult to interpret. Therefore, there is a need for methods that learn interpretable structure from such data, especially for methods that can apply across many different domains. In this talk, I will discuss several approaches for analyzing time series data, as well as future directions of research in this field, incorporating different research areas ranging from distributed convex optimization to deep learning.
Bio: David is a 5th-year PhD student in the Electrical Engineering department at Stanford University, working with Jure Leskovec and Stephen Boyd. His research is focused on scalable optimization methods. He has worked with datasets ranging from automobile sensors to protein-protein interaction networks, developing mathematical models for finding patterns in complex data and implementing these methods in high-performance solvers. Prior to Stanford, he received his B.S. in Electrical Engineering from the University of Pennsylvania.
David Betz - Boeing Global Services
Machine Learning that Works
Francisco Martin BigML, Inc.
Abstract: In the last few years, we have seen Machine Learning quickly moving from academia to industry. However, most Machine Learning tools have been created by scientists for scientists limiting their use to highly qualified experts. These tools are not only very complicated to use but also neglect the fundamental pillars necessary to create and operate end-to-end Machine Learning workflows: traceability, repeatability, data transformations, feature engineering, scaling, real-time scoring, monitoring, and retraining. In sharp contrast, BigML, founded in 2011, has been methodically building a platform that abstracts away the complexities of Machine Learning, making Machine Learning beautifully simple for everyone. In this talk, I will describe the three fundamental design principles that have been driving BigML's evolution over the past 7 years: consumability, programmability, and scalability.
Bio: Francisco is the CEO at BigML, Inc where he helps conceptualize, design, architect, and implement BigML's distributed Machine Learning platform. Formerly, Francisco founded and led Strands, Inc, a company that pioneered Behavior-based Recommender Systems. Previously, he founded and led Intelligent Software Components, SA (iSOCO), the first spin-off of the Spanish National Research Council (CSIC). He holds a 5-year degree in Computer Science, a Ph.D. in Artificial Intelligence, and a post-doc in Machine Learning. He is the holder of 20+ patents in the areas of Recommender Systems and Distributed Machine Learning.
Opportunities and Limitations of Face Detection Technology
Jisun An - Qatar Computing Research Institute (QCRI) - HBKU
Abstract: In recent years face detection technology has been rapidly applied to many areas from practical applications to social science research. In particular, face detection technologies are considered as key elements in computational social science, as they allow the demographic attributes of online users to be inferred easily and quickly. Demographics are one of the key predictors of human behavior. The life of a 50-year-old African American woman is probably very different from that of a 16 year-old-White boy. Hence, the 190-billion-dollar US advertising industry uses demographics to help deﬁne consumer segments that can then be targeted through dedicated campaigns.
In this talk, I introduce a series of my computational social science research using face detection technologies. First, I present the ﬁrst large-scale study on how different hashtags are used by different demographic groups on Twitter. This work shows that a population-level analysis of hashtags and trends on Twitter is likely to miss the complexities induced by demographic-speciﬁc behavior. Second, I provide the ﬁrst in-depth characterization of news spreaders in social media. Among our main ﬁndings, we show that males and white users tend to be more active in terms of sharing news, biasing the news audience to the interests of these demographic groups. Our results also quantify diﬀerences in interests of news sharing across demographics, which has implications for personalized news digests. Third, I investigate advertisements collected from social media to see how men and women of different races are depicted regarding simple appearances, cross-sex interactions, and positive portrayals. Our analysis covers 363,613 posts posted by 73 international brands on Facebook and Instagram and their 17M comments. I illustrate a diversity map locating brands based on their gender and racial diversity in a single figure. As it is fully generated without additional human efforts, it can work as a watchdog to show the current practice of the brands' advertising on social media. Then, we also show that there is evidence for resonance between the demographics depicted in a particular post and that of the engaging users. Finally, I provide a comprehensive measurement study of four widely-used face detection tools, which are Face++, IBM Bluemix Visual Recognition, AWS Rekognition, and Microsoft Azure Face API, using multiple datasets in terms of their accuracy and bias.
Bio: Dr. Jisun An is a scientist at Qatar Computing Research Institute, HBKU. She received her Ph.D. in Computer Science from the University of Cambridge, UK in 2015. She conducts interdisciplinary research connecting Computer Science and Journalism and Public opinion. She works on applying statistical methods and machine learning techniques in capturing public opinion from social media and recently she has been focused on bias and diversity in social media data. She has been a member of the PC of major computer science and computational social science conferences, including ICWSM 2012-18, WWW 2016-18, and SocInfo 2014-17.
Making Data Meaningful to People through Stories
Larry Birnbaum - Northwestern University
Abstract: The astounding growth in data gathering, processing, storage, and networking capabilities over the past decades has opened the prospect of revolutionary advances in everything from medicine to media - if these data can actually be exploited properly. A key bottleneck is providing insight and understanding based on these data to people who need to make decisions using them. This talk will describe our work on the automatic generation of natural language stories from data, aimed at conveying key insights about those data to people. I will also outline our work on contextual search aimed helping people find information they need in the moment; on finding important and interesting patterns in data (especially social media); and on the automation of editorial judgment more generally.
Bio: Larry Birnbaum is Professor of Computer Science at Northwestern University, where he is Head of the Computer Science Division and Co-Director of the Intelligent Information Laboratory, with a research focus on applied AI. He and his students build and research projects in natural language processing, intelligent information systems, social media data analytics, machine learning, computational journalism and media, conversational interfaces, and automatic content generation. Together with colleagues and students, Larry has published more than 140 papers on these topics, and holds 39 U.S. patents.
Larry received his BS and PhD degrees in Computer Science from Yale, and was on the faculty there before joining Northwestern.
Larry is also Co-Founder and Chief Scientific Advisor of Narrative Science, an AI start-up that builds and markets technology to automatically generate narratives from data, at scale. The company's goal is simply this: To make the world's data meaningful to people through stories.
Mohamed Mokbel - Qatar Computing Research Institute (QCRI) - HBKU
Abstract: The need to manage and analyze spatial data is hampered by the lack of specialized systems to support such data. System builders mostly build general-purpose systems that are generic enough to handle any kind of attributes. Whenever there is a pressing need for spatial data support, it is considered as an afterthought problem that can be addressed by adding new data types, extensions, or spatial cartridges to existing systems. This talk advocates for dealing with spatial data as first class citizens, and for always thinking spatially whenever it comes to system design. This is well justified by the proliferation of location-based applications that are mainly relying on spatial data. The talk will go through various system designs and show how they would be different if we have designed them while thinking spatially. Examples of these systems include database systems, big data systems, recommender systems, and crowdsourcing.
Bio: Mohamed Mokbel (Ph.D., Purdue University, MS, B.Sc., Alexandria University) is Chief Scientist at Qatar Computing Research Institute (QCRI). Before joining QCRI, he has been a Professor in the Department of Computer Science and Engineering, University of Minnesota. His research interests include the interaction of GIS and location-based services with database systems and cloud computing. His research work has been recognized by the VLDB 10-Years Best Paper Award, five Best Paper Awards, and by the NSF CAREER award. Mohamed has held prior visiting positions at Microsoft Research and Hong Kong Polytechnic U., and is a co-founder of the GIS Technology Innovation center in Saudi Arabia. Mohamed is/was the program co-chair for ACM SIGMOD 2018, ACM SIGSPATIAL GIS 2008-2010, and IEEE MDM 2011 and 2014. He is Editor-in-Chief for Springer Distributed and Parallel Databases journal, and Associate Editor for ACM Books, ACM TODS, ACM TSAS, VLDB journal, and GeoInformatica. Mohamed was an elected Chair of ACM SIGSPATIAL 2014-2017.
Applied Machine Learning
Poul Petersen BigML, Inc.
Abstract: Most sciences have a schism between the theoretical and the applied, and Machine Learning is no different. While the theoretical side is absolutely necessary to drive innovation of the science in its purest most intellectual form, the applied side can not be ignored as it forms the bridge between the theoretical and the real world, between a great idea and a useable tool.
And while the AI hype shows us all the amazing things being achieved with AI/ML, few bother to pull back the curtain to show all the prior failures, complexity, and nasty details of the actual implementation. The truth is, there is no ML algorithm which goes from publication to the real-world without being modified, specialized, and most likely added as part of a much more complicated workflow. A workflow which in aggregate we might call a Predictive Application.
And so, the path forward, that is to the wide adoption of Machine Learning to solve real-world problems, requires tools which commoditize the theoretical into easy to consume resources that can be connected into these more complicated workflows. And ideally, to make this easy, repeatable, consumable, and able to handle the nuances of ugly real-world data.
In this talk, we will take a look at some real-world predictive applications and pull back the curtain on the ugly details of how they work. And then we will build a predictive application, step-by-step, showing how tools like BigML make this process easy.
Bio: Poul is Chief Infrastructure Officer at BigML. He has an MS degree in Mathematics as well as BS degrees in Mathematics, Physics and Engineering Physics. With 20 plus years of experience building scalable and fault tolerant systems in data centers, Poul currently enjoys the benefits of programmatic infrastructure, hacking in python to run BigML with only a laptop and a cloud.
Regularized Gradient Boosting Machines for Inferring Gene Regulatory Networks and Applications in Cancer
Raghvendra Mall Qatar Computing Research Institute (QCRI) - HBKU
Abstract: Transcription factors (TF) that regulate gene expression are key determinants of cellular phenotypes. Reconstructing large-scale genome-wide networks capturing the influence of TFs on target genes is essential for understand and accurately model living cells. In this talk, we present a generic framework, where gene regulatory network (GRN) inference problem is approached as a supervised learning (feature selection) problem. GRNs obtained using Machine Learning techniques are often dense, whereas real GRNs are rather sparse. We use a Tikonov regularization inspired optimal L-curve criterion that utilizes the edge-weight distribution for a given target gene to determine the optimal set of TFs associated with it. Our proposed framework allows to incorporate a priori information in the form of a mechanistic active biding network based on cis-regulatory motif analysis.
We evaluate our regularization framework in conjunction with gradient boosting machines (GBM) resulting in a regularized feature selection based method specifically called RGBM. RGBM has been used to identify the main transcription factors that are causally involved as master regulators of the gene expression signature activated in the FGFR3-TACC3-positive glioblastoma cancer. In another application, we show that RGBM identifies the main regulators of the discrete molecular subtypes of glioblastoma tumors. Our analysis reveals the identity and corresponding biological activities of the master regulators characterizing the difference between G-CIMP-high and G-CIMP-low subtypes and between PA-like and LGm6-GBM, thus providing a clue to the yet undetermined nature of the transcriptional events among these subtypes.
Bio: Dr. Raghvendra obtained his doctorate (Summa Cum Laude) from KU Leuven, Belgium in 2015. His dissertation was on exploring the role of sparsity in large scale machine learning. He is currently working as a Postdoctoral Researcher at QCRI, HBKU, Doha. Lately, his research is focused on developing and utilizing data driven modeling techniques for computational biology with a primary focus on network biology and structural bioinformatics. Specifically, he is interested in problems like differential network analysis, gene regulatory network inference, master regulator analysis and disease module identification in biological networks. Several of his works are published in premier bioinformatics journals such as Nucleic Acids Research, Bioinformatics, BMC Systems Biology and he has co-authored papers in Nature and Cell (in press).
Five Pillars for Advancing Applied AI Research
Ramsis Adam Director of Analytics and Simulation Technology, Boeing
Quantum information from a physics perspective
Sahel Asshab - Qatar Environment and Energy Research Institute (QEERI) - HBKU
Abstract: In the past few decades, scientists have been finding various ways in which the laws of quantum physics can be used to enhance computational tasks. Although quantum mechanics predicts that a computer should be able to manipulate a large number of inputs in parallel, it also says that only one output can be obtained. As a result, clever algorithms are needed to harness any computational benefit from quantum parallelism. I will talk about the basic laws of quantum mechanics and how they apply to the manipulation of digital information. I will also talk about some of the recent progress on the implementation of quantum computing devices.
Sahel Ashhab is a Senior Scientist at the Qatar Environment and Energy Research Institute (QEERI). He obtained his Ph.D. in physics at the University of Illinois in 2002 working on the theory of Bose-Einstein condensation in cold atomic gases. He continued this research as a postdoctoral researcher at the Ohio State University. From 2004 until 2013 he was a research scientist at the Institute of Physical and Chemical Research (RIKEN), Japan, where his research focused mainly on quantum phenomena in electric circuits, including some of their applications for quantum information processing. Since joining QEERI in 2013, his research has focused on quantum phenomena in energy applications.
Human Centric Machine Learning
Seema Chopra - Boeing Research and Technology, India
Abstract: Machine learning (ML) is widely used to solve complex business problems by building algorithms which are useful in prediction of future events. Owing to the empirical nature of most ML models, their results are sometimes difficult to interpret, hence considered to be a “black box”. This black box nature prevents from fully understanding how the algorithms arrived at their results and therefore may reduce end-customer’s confidence in using ML algorithms developed by data scientists. One of the ways of getting insights on what’s happening inside the black box of ML models is to provide a framework that will enable SMEs to look inside the ML models as they are being developed and refined. This framework should be dynamic, interactive and should have an active feedback mechanism that will enable humans to interact with the machine learning models in order to understand and refine them to meet their needs. Such a mechanism will create a real-time interactive ML model which will incorporate domain knowledge into the model’s development process as they are being built. This methodology of interactively adding human intelligence, in this case domain knowledge, while building “white box” ML models is termed Human Centric Machine Learning. It has been observed that human intelligence based on domain knowledge as an input into ML models always performs better and makes them more reliable for decision-making. This talk will highlights the role of humans in the field of machine learning which begins from Data exploration phase to model development and how HCML will address the long-standing need for both human guided machine learning and human interpretable results of machine learning.
Bio: Seema is working as Technical Lead - Data Analytics in System and Analytics group at Boeing Research and Technology, India. Her current work includes developing next generation advanced health management technologies using real time streaming airline data & big data platforms. Prior to this role, she was with GE as a PHM (Prognostic Health Management) Technical Leader and was involved in design and developing prognostic health management technologies to enable strategic growth for Condition Based Maintenance for Gas turbines. Seema earned her doctorate degree in Control engineering from IIT Roorkee, India and the focused area was to design Fuzzy Controller with Intelligent Design Approaches with reduced rule set.
Seema has 15+ years of research experience in the area of advanced analytics solutions for different applications. Her research includes different areas like Continuous Analytics, Fault Diagnosis and Prognosis, Real time streaming, Big Data platforms, Data Mining, Machine learning, IVHM and Control system.
Seema is certified Black belt - DFSS Lean Six Sigma and has 30+ publications in various International/National journals & conferences, 8 Technical reports and 2 filed patent and 2 submitted disclosures and received several awards for leadership and technical expertise including PHM Expertise award from President & CEO, GE Power Gen Services and GE Impact award 2012 from CEO of GE, for volunteering on Mid-Day meal.
Her recent talks as Invited Speaker includes (Airplane Health Management) AHM, Spark and Future of Advanced analytics, Big Data Analytics and Continuous Analytics in Flight Data Processing in organization Iike IISC, NAL, HAL, Unicom, UTC Aerospace, Shell, SAP etc. She also has some additional responsibilities as SWE (Society of Women Engineers) International Ambassador and BWIL (Boeing Women in Leadership) Vice President to grow young women engineers.