MLDAS 2016 - Machine Learning and Data Analytics

MLDAS 2016 - Machine Learning and Data Analytics - Doha, Qatar

Symposium

Schedule

Local Information

Venue
Lodging
Transportation

MLDAS Contributed Submissions

Scalable and Unified Linear Algebra Algorithms for Mining Large and Complex Attributed Graphs and Networks - Application to Social and Biological Networks. Authors: Abdelkader Baggag*, Qatar Computing Research Institute

Advances in technology have led to a proliferation of large scale datasets, originating from many real world domains that can often be aptly represented in the form of complex relationship or interaction networks in a concise and meaningful fashion. We propose to design and implement a number of fundamental matrix-mining and graph-mining algorithms that lie at the core tasks of analyzing, visualizing, and extracting information from massive matrices and graphs. Our main focus is to formulate a unified linear algebra framework to tackle data mining and machine learning problems in complex enriched networks with rich attributes. The main novelty of our research is to incorporate rich label information on the nodes and edges. Motivating applications will come from large-scale real-world graphs from social networks (e.g., Twitter decahose), and biological networks (e.g., gene expression network, protein interaction).

From Classification to Quantification in Tweet Sentiment Analysis. Authors: Wei Gao, Qatar Computing Research Institute; Fabrizio Sebastiani*, Qatar Computing Research Institute

Classifying tweets according to the sentiment they convey towards a given entity (e.g., a product) has many applications in political science, social science, market research, and many others. In this paper we contend that most previous studies dealing with tweet sentiment classification (TSC) use a suboptimal approach. The reason is that the final goal of most such studies is not estimating the class label (e.g., Positive, Negative, or Neutral) of individual tweets, but estimating the relative frequency (a.k.a. "prevalence", or "prior") of the different classes in the dataset. When approached in a supervised way, the latter task is called "quantification". In this paper we show (by carrying out experiments using seven different quantification-specific algorithms and eleven different TSC datasets) that using quantification-specific algorithms produces substantially better class frequency estimates than a state-of-the-art classification-oriented algorithm routinely used in TSC.

How Data Analytics will Drive Future Road Safety Applications. Authors: Muhammad Awais Javed*, Qatar Mobility Innovations Center (QMIC); Elyes Ben Hamida, QMIC

Cooperative Intelligent Transportation Systems (C-ITS) are key component of the future road traffic management system. The ITS stations including vehicles, road side units and traffic command centers generate large amount of traffic and mobility related data. To analyse this large amount of data and extract useful information out of it, data analytics will play a critical role. In this paper, we present C-ITS architecture and lay down various technical challenges of using data analytics in C-ITS. We also highlight various applications where data analytics can play a vital role and help in improving the working of C-ITS. Finally, we present practical simulation results in NS-3 network simulator on using data analytics to improve safety awareness of a secure C-ITS.

Analyzing Teaching Effectiveness using Machine Learning Techniques. Authors: Anwar Yahya*, Najran University

Teaching effectiveness is a multidimensional construct in which teacher questioning skill is one of its key indicators. This paper explores the feasibility of applying Machine Learning (ML) to analyze teaching effectiveness using a data set of teachers’ questions. More specifically, the performance of nine ML techniques are investigated for the classification of teachers’ classroom questions into the six Bloom's cognitive levels (BCLs). In doing so, the data set has been annotated with BCLs, transformed into a suitable representation, and ML techniques have been applied under different dimensions. The results confirms the feasibility of using ML for analyzing teaching effectiveness. Moreover, due to the sensitivity of each technique to the curse of dimensionality problem, the performance vary. Most remarkably, Support Vector Machine and Random Forest techniques show a striking performance, whereas Adabost and J48 show a sharp performance deterioration as the dimensionality increases.

Application of Recommender Systems in Emerging Domains. Authors: Manoj Reddy*, University of California, Los Angeles; Junghoo Cho, University of California, Los Angeles

Recommender systems have gained tremendous popularity on consumer platforms in the recent years. It is being widely employed to recommend movies, news, music, products etc. These systems enhance user experience by filtering enormous amount of information based on user interest. Despite the advancements made, there is room for improvement in providing better recommendations. In this paper, we discuss some of the fundamental algorithms and concepts that underly most recommendation systems. We also discuss upcoming trends that present new challenges and approaches to tackle them. In addition, we present the opportunities that arise by using recommender systems in emerging domains such as sports, education, healthcare and tourism. We focus on these domains since they produce vast amounts of data and recommender systems have the potential to positively impact them. We layout a roadmap of the implementation of recommender systems for real-world applications in unconventional domains.

Securing the Internet of Vehicles: Machine Learning to the Rescue. Authors: Elyes Ben Hamida*, QMIC; Muhammad Awais Javed, Qatar Mobility Innovations Center (QMIC)

The Internet of Vehicles (IoV) is a new emerging concept that consists in the convergence of traditional Vehicular Adhoc Networks (VANETs), the mobile Internet and the Internet of Things (IoT). IoV aims at integrating humans, vehicles, things and the environment into a smart global network that enables new urban mobility services. In this context, IoV can collect large-scale and dynamic data to improve the safety of roads, vehicles and passengers. However, managing the security and authenticity of IoV data is a challenging task, especially in dense urban environments. This paper proposes a new machine learning based framework that prioritizes the authentication of the collected IoV data and enhances the safety awareness of IoV vehicles and users. Simulation results are then performed to verify the performance enhancements.

Restricted Multi-Pruning of Decision Trees. Authors: Mohammad Azad*, KAUST; Shahid Hussain, KAUST; Igor Chikalov, KAUST; Mikhail Moshkov, KAUST

Decision trees are extensively used as classifiers. However, the trade off between decision tree size and good classification accuracy is a research challenge. This can be achieved if we create multiple pruned trees from the set of Pareto-optimal points using dynamic programming approach (multi-pruning process). However, this process can be extensively slow. We consider a modification of multi-pruning process (restricted multi-pruning) that requires less memory and time but usually keeps the accuracy of constructed classifiers. This refinement, outperforms CART (in 10 cases out of 15 benchmark decision tables from UCI ML Repository), as well significantly reduces the time (at least 5 times) compared to the old approach.

Classification for Inconsistent Decision Tables. Authors: Mohammad Azad*, KAUST; Mikhail Moshkov, KAUST

Decision trees have been used widely to discover patterns from consistent data set. But if the data set is inconsistent, where there are groups of examples with equal values of conditional attributes but different labels, then to discover the essential patterns or knowledge from the data set is challenging. Three approaches (generalized, most common and many-valued decision) have been considered to handle such inconsistency. The decision tree model has been used to compare the classification results among three approaches. Many-valued decision approach outperforms other approaches, and Mult_ws_entML heuristic gives faster and better prediction accuracy.

Ensemble Late Fusion Classification of Schizophrenia using Multimodal Features. Authors: Hazrat Ali*, COMSATS Institute of Information Technology Abbottabad; Khalid Iqbal, COMSATS Institute of Information Technology Attock

Schizophrenia classification is a challenging task. In this work, we propose the use of an ensemble late fusion approach which uses a combination of probability scores from three different classifiers, and classify the schizophrenia on the basis of multimodal features. The multimodal features comprises of the Functional Network Connectivity features and the Source-Based Morphometry features, concatenated together. Once the prediction probabilities of the proposed approach are obtained, they are submitted to the online evaluation platform of the Kaggle Schizophrenia classification chellenge, for overall evaluation. Results show that the evaluation score for the ensemble late fusion approach is better than baseline approach.

SIDRA: a blind algorithm for signal detection in photometric surveys. Authors: Dimitris Mislis*, Qatar Foundation - QEERI

We present the Signal Detection using Random-Forest Algorithm (SIDRA). SIDRA is a detection and classification algorithm based on the Machine Learning technique (Random Forest). The goal of this paper is to show the power of SIDRA for quick and accurate signal detection and classification. We first diagnose the power of the method with simulated light curves and try it on a subset of the Kepler space mission catalogue. The algorithm uses four features in order to classify the light curves. The training sample contains 5000 light curves and 50000 random light curves for testing. The total SIDRA success ratio is 90%. As a result, our algorithm detects 7.5% more planets than a classic detection algorithm, with better results for lower signal-to-noise light curves. SIDRA promises to be useful for developing a detection algorithm and/or classifier for large photometric surveys such as TESS and PLATO exoplanet future space missions.

Sentiment Features for Leading Bloggers in Virtual Community. Authors: Khalid Iqbal*, COMSATS Institute of Information Technology Attock

A blog is regularly updated website, run by an individual or a small group, for informal discussion. Blogosphere is a distinct online network for bloggers to influence or convince people, known as influential bloggers. Therefore, recognition of influential bloggers has an extensive use in online marketing, sales prediction and electronic commerce. In this paper, we proposed MIBSF (Model for Influential Bloggers using Sentiment Features) model based on activity, recognition and sentiment features. Experiments are performed on a real-world social media dataset i.e. Techcrunch. Performance of the MIFSF is compared with the existing methods by proving the significant contribution of sentiment feature in finding top influential bloggers.

Qurb: Qatar Urban Analytics. Authors: Sofiane Abbar, QCRI, HBKU; Laure Berti*, Qatar Computing Research Institute; Javier Borge-Holthoefer, QCRI, HBKU; Sanjay Chawla, QCRI; Hossam Hammady, QCRI, HBKU; Jaideep Srivastava, QCRI, HBKU

Doha is one of the fastest growing cities of the world with a population that has increased by nearly 40% in the last five years. QCRI has initiated several research projects related to urban computing to better understand and predict traffic mobility patterns in the city of Doha. A key element of our vision is to integrate data from physical and social sensing, into what we call s?ocio-physical sensing and to develop novel analytics approaches to mine urban data from various modalities (i.e., physical sensor data, images, microtexts and texts, structured data from the Web and social media). The overall goal is to help citizens in their everyday life in urban spaces, and also help transportation experts and policy specialists to take a real time data-driven approach towards urban planning and real time traffic planning in the city.

O2PLS mining of mutiple metabolomics data from the date fruit reveals a highly dynamic ripening process accounting for major variation in fruit composition. Authors: Ilhame Diboun*, WCMC-Q

Metabolomics techniques may reveal the level of hundreds of small molecules in biological samples. The aim of this study was to characterize the major determinants of date fruit metabolome. To this end, one large cohort of date fruits from the Gulf/Middle East region and a similarly sized collection of North African date varieties were measured with metabolomics techniques in separate batches. A subset of fruits from the second batch was immature and featured varying fruit development stages. In this paper, we highlight the efficiency of OPLS and O2PLS in mining and integrating this heterogeneous collection of metabolomics data to reach the main conclusion that it is the ripening process that causes most variability in the metabolome of dates, also underpinning the difference between the fruit's major phenotypic types being the soft and the dry type.

Topic Detection In Latest Islamic Forum Conversations Using Conceptual Methods. Authors: Wafa Waheeda Syed*, Qatar University; Zeineb Omar Safi, Qatar University; Kalthoum Yousuf Adam, Qatar University; Dr. Ali Jaoua, Qatar University; Dr. Abdelaali Hassaine, Qatar University

With the large amounts of constantly changing data in Islamic forums, there is an impelling need for a tool that extracts discussion topics from the data streams and displays them in a summarized way to the end users. In our work, we build on the idea of conceptual text summarization as a method of information extraction from data, to analyze the conversation text extracted from the forums and display the recent, most discussed topics to the end users using data visualization techniques. A system that crawls an Islamic forum, extracts recent conversation text, detects the most trending topics being discussed using conceptual text summarization and then displays them to the user in the form of a word cloud has been implemented. In most cases the results of our experiments were acceptable.

SepMe: 2002 New Visual Separation Measures. Authors: Michael Aupetit*, QCRI

Our goal is to accurately model human class separation judgements in color-coded scatterplots. Towards this goal, we propose a set of 2002 visual separation measures, by systematically combining 17 neighborhood graphs and 14 class purity functions, with different parameterizations. Using a Machine Learning framework, we evaluate these measures based on how well they predict human separation judgements. We found that more than 58% of the 2002 new measures outperform the best state-of-the-art Distance Consistency (DSC) measure. Among the 2002, the best measure is the average proportion of same-class neighbors among the 0.35-Observable Neighbors of each point of the target class (short GONG 0.35 DIR CPT), with a prediction accuracy of 92:9%, which is 11:7% better than DSC. We also discuss alternative, well-performing measures and give guidelines when to use which. This paper has been accepted for presentation at IEEE PacificVis 2016 conference in Taipei, Taiwan, mid-April 2016.

Pessimistic Uplift Modeling. Authors: atef shaar*, Telecom ParisTech

Uplift modeling is a data mining technique that seeks to model the heterogeneity in treatment effects, to predict the differences in class variable behavior between different environments. Uplift modeling helps in solving problems from many fields like marketing, finance and health sector. Uplift models tend to be very sensitive to the noise in data which leads to unreliable results. We showed the various attempts to model treatment heterogeneous effects. We introduce a new approach that is based on noise minimization, by using regular predictive modeling algorithms.

Semantic and Visual Cues for Humanitarian Computing of Natural Disaster Damage Images. Authors: Hadi Jomaa*, American University of Beirut; Mariette Awad, American University of Beirut

Identifying different types of damage is very essential in times of natural disasters,where first responders are flooding the internet with often annotated images and texts,and rescue teams are overwhelmed to prioritize often scarce resources.While most of the efforts in such humanitarian situations rely heavily on human labor and input,we propose in this paper a novel hybrid approach to help automate more humanitarian computing.Our framework merges low-level visual features that extract color,shape and texture along with a semantic attribute that is obtained after comparing the picture annotation to some bag of words.These visual and textual features were trained and tested on a dataset gathered from the SUN database and some Google Images.The best accuracy obtained using low-level features alone is 91.3%,while appending the semantic attributes to it raised the accuracy to 95.5% using linear SVM and 5-fold cross-validation,which motivates:an Annotated image is worth a thousand word.

FC-Sweeper : Extracting and Navigating within the top-k formal concepts. Authors: Amira Mouakher*, University of Tunis El Manar; Sadok Ben Yahia, University of Tunis El Manar

Concept lattices have been shown to be of benefit within the task of knowledge discovery in databases. However, its effective use in large datasets was always braked by the overwhelming number of drawn formal concepts. In the aim of filtering out, such hazy covering graphs of formal concepts, quality metrics could be of a valuable help. In this paper, we introduce a new approach that relies on a multi-criteria aggregation algorithm for getting out top-k formal concepts. To the best of our knowledge, FC-Sweeper approach is the one that tackled such an issue. Snapshots of the developed prototype flag out that it could be of worthwhile help for large dataset exploration.

Deep Neural Network for the Prediction of Response to Therapy based on Viral Genome. Authors: Sabri Boughorbel*, Sidra; Jana Blazkova, Sidra; Rawan AlSaad, Sidra Medical & Research Center; Nagarajan Kathiresan, Sidra; Rashid Al-Ali, Sidra

The use of antiretroviral treatment is the golden standard for HIV-1 therapy. However, treatment failures can occur and patients are still at risk of developing drug resistance. Therefore developing advanced computational techniques for the analysis of virus genome is crucial for the identification of sequences that are responsible for patient condition improvement and hence help to discover new therapies. We present our results on the application of deep neural network pre-trained with stacked auto-encoder for the prediction of HIV-1 treatment progression. We extracted a set of features from virus genome, namely known mutations and epitopes. We evaluated the method on a publicly available dataset of 1000 patients that contracted HIV-1 virus and had the treatment at start of the measurement. The obtained prediction performance is 76.1% in term of AU-ROC. The analysis of variable relative importance revealed key predictors from the virus genome on treatment progression.

Mobile App Conceptual Browser Utilization for Hidden Data Analytics of Online Marketplaces. Authors: Aboubakr Aqle*, Qatar University; Eman Rezk, Qatar University; Fahad Islam, Qatar University; Ali Jaoua, Qatar University

In this paper, extensive analysis and evaluation of existing e-marketplaces is performed to improve end-users experience through a Mobile App that can integrate multiple heterogeneous hidden data sources and unify the received responses to one single, structured and homogeneous source. The user can reformulate the query through the interface. The proposed Android Mobile App is based on the multi-level conceptual analysis and modeling discipline, in which, the data is analyzed in a way that helps discovering the main concepts of any unknown domain captured from hidden web. These concepts are structured as tree based interface for easy navigation and query reformulation. The application has been evaluated through substantial experiments depending on data analytics for discovering concepts. The results showed that query results analysis and re-structuring the output before displaying to the end-user in conceptual multilevel mechanism are reasonably effective.

Sentiment Analysis using Hyper Rectangular Decomposition: Application to Comments on News Articles. Authors: Khalid Al-Kubaisi, Qatar University; Dr. Abdelaali Hassaine*, Qatar University; Dr. Ali Jaoua, Qatar University

Sentiment Analysis aims at understanding opinions expressed by the crowd on the Internet. Typical applications include understanding general opinions expressed in comments on News articles, learning more about movie reviews, understanding opinions expressed about a certain political candidate, recently released product ...etc. Sentiments range generally into three main categories: positive, negative and neutral. In this article we present a novel method to predict the sentiment associated to comments of News articles. In order to classify sentiments, we used the hyper concept algorithm which represents the corpus of comments by a binary relation and then extracts the most relevant representative keywords for each category of sentiment. The extracted keywords are used as predictors and combined through a Random Forest classifier in order to predict the sentiment of each comment. The method achieves 89.77% accuracy on a database of more than 5000 comments from Al Jazeera website.

Person Pose Detection and Segmentation in Dense Spectator Crowds. Authors: Muhammad Shaban*, Qatar University; Arif Mahmood, Qatar University; Nasir Rajpoot, University of Warwick

Person detection and segmentation is a key step for person tracking, action recognition and activity classification in dense crowds. It is very challenging due to high level of inter-person occlusions and low resolution. We propose a segmentation algorithm for dense spectator crowd images to segment each person holistically by inferring the relative positions of surrounding persons as opposed to previous methods which segment each person independent of its neighborhood. Our approach uses articulated human pose and their relative spatial positions to resolve horizontal and vertical person-person occlusions and generates a segmentation mask for the whole image. Our technique works equally well for non-crowded images. Experimental results show that our approach achieves good performance on images with high degree occlusion.

Automatizing Job Offers Recommendation with Time Series Forecasting and Semantic Classification.. Authors: Sidahmed Benabderrahmane*, Paris 8 University

Nowadays, the most prominent way to attract job candidates is through dedicated web-based portals, and therefore process their related data in automatic ways based on several optimized algorithms. In this environment, with the goal of sharing, at best, the job offers, many on-line job boards have been created, the choice of which can be sometimes very hard for the recruiters that aim at attracting the best possible candidates in the shortest amount of time. We propose a novel job board recommendation system that aims at estimating the best potential job boards for a given text job offer. Our recommendation system is based on an hybrid representation, that combines semantic knowledge and time series forecasting. The semantic classification of job boards requires a textual analysis using domain knowledge. The time series analysis module is to predict the best job board for a given offer. The proposed system has been evaluated on real data, and preliminary results seem very promising.

Pattern recognition in hyperspectral data acquired during surgical procedures: Differentiation between nerve and adipose tissue Authors: Rutger Schols, Maastricht University Medical Center & NUTRIM School for Nutrition, Toxicology and Metabolism; Mark ter Laan, Radboud University Nijmegen Medical Center & Canisius Wilhelmina-Hospital; Laurents Stassen, Maastricht University Medical Center & NUTRIM School for Nutrition, Toxicology and Metabolism; Nicole Bouvy, Maastricht University Medical Center & NUTRIM School for Nutrition, Toxicology and Metabolism; Fokko Wieringa, TNO; Lejla Alic*, TNO

Intraoperative nerve localization is extremely important during surgery, especially laparoscopy. This is particularly challenging when nerves show visual resemblance to surrounding tissue. An example of such a delicate procedure is thyroid and parathyroid surgery, where iatrogenic injury of the recurrent laryngeal nerve can result in transient or permanent vocal problems. A camera system, enabling nerve-specific image enhancement, would be useful in preventing such complications. Hyperspectral camera technology has a potential to provide a nerve-specific image enhancement. As a first step towards such a dedicated camera system, we evaluated the availability of useful spectral tissue signatures by diffuse reflectance spectroscopy using silicon (Si) and indium gallium arsenide (InGaAs) sensors. The spectral signatures from the combined Si & InGaAs bandwidth ranges 350–1,830 nm (1 nm spectral resolution) were used to develop a classifier.