MLDAS banner

MLDAS 2018 Contributed Submissions

Capsule-Net for Urdu Digits Recognition. Authors: Talha Iqbal and Hazrat Ali , COMSATS Institute of Information Technology

A capsule is formed when a group of additional neurons is added to existing convolutional layer in a typical convolutional neural network (CNN). Capsules have activity vector that represents instantiation parameters of an object or part of an object. Capsule network has recently been introduced by Hinton to overcome the shortcomings of typical CNN model trained with back-propagation. In this work, we investigate the use of capsule networks for recognition of handwritten digits of Urdu. Our results show that a multi-layer capsule network achieves better results (98.5% accuracy) than deep auto-encoder (97.3% accuracy), especially when we have digits that are highly overlapped.

Fraud Detection in Financial Transactions with Deep Learning. Authors: Muhammad Muneeb Saad and Hazrat Ali, COMSATS Institute of Information Technology Abbottabad, Pakistan

This paper presents the detection of fraudulent activity in financial transactions with the use of Bidirectional Long Short-Term Memory (BLSTM). Fraudulent transactions cause billions of dollars losses to financial institutes around the globe. Detection of fraudulent activities is a challenging task. Various machine learning techniques are applied to prevent this environment from such criminal activities. In this paper, we use (BLSTM) network on time-series financial data. By this approach, the classification of fraudulent transactions is accomplished. Experimental results on data obtained from [6] show an overall accuracy of 95.3% for the given task.

Next utterance ranking based on context response similarity. Authors: Basma El Amel BOUSSAHA, Nicolas Hernandez, Christine Jacquin, and Emmanuel Morin, University of Nantes

Building dialogue systems that converse with humans in order to help them in their daily tasks is being a priority. Some systems converse by generating dialogues word by word whereas others retrieve the best utterance among a set of candidate responses. These retrieval systems rank the candidate responses by their relevance to the history of the conversation (context), the best response is then chosen. Approaches based on deep neural networks performed well on this task. In this work, we improve a state of the art approach based on an LSTM dual encoder and propose a new response retrieval dialogue system. Based on syntactic and semantic similarities between the context and the response extracted from word embeddings, our approach learns to match the context with the best response. Experimental results on the Ubuntu Dialogue Corpus show an important improvement of about 7%, 6% and 2% on Recall@(1, 2 and 5) compared to the best state of the art system.

On the Predictive Analysis of Behavioral Massive Job Data Using Embedded Clustering and Deep Recurrent Neural Networks. Authors: Anwar Yahya*, Najran University

Teaching effectiveness is a multidimensional construct in which teacher questioning skill is one of its key indicators. This paper explores the feasibility of applying Machine Learning (ML) to analyze teaching effectiveness using a data set of teachers’ questions. More specifically, the performance of nine ML techniques are investigated for the classification of teachers’ classroom questions into the six Bloom's cognitive levels (BCLs). In doing so, the data set has been annotated with BCLs, transformed into a suitable representation, and ML techniques have been applied under different dimensions. The results confirms the feasibility of using ML for analyzing teaching effectiveness. Moreover, due to the sensitivity of each technique to the curse of dimensionality problem, the performance vary. Most remarkably, Support Vector Machine and Random Forest techniques show a striking performance, whereas Adabost and J48 show a sharp performance deterioration as the dimensionality increases.

Application of Recommender Systems in Emerging Domains. Authors: Sidahmed Benabderrahmane, The University of Edinburgh

This paper presents a new job board recommender system that intends to guide recruiters while they are posting a job on the Internet. Firstly, Doc2Vec embedded representation is used to analyze the textual content of the offers, then the job applicant clicks history on various job boards are stored in a large database, and then represented as time series. Secondly, a deep neural network is used to predict future values of the clicks on the job boards. Third, and in a parallel way, dimensionality reduction techniques are used to transform the clicks numerical series into temporal symbolic sequences. Forecasting algorithms are then used to predict future symbols for each sequence. Finally, a list of top ranked job boards are kept by maximizing the clicks forecasting in both representations. Our experiments were tested on a real dataset, and the promising results have shown that using deep learning, the recommender system outperforms standard models.

UNIVERSITY IMAGE AS A FUNCTION OF STUDENT SATISFACTION (Estimation of a Structural Model for Student Satisfaction Survey (2015 – 16), Qatar University). Authors: Zainab Siddiqui and Abdel-Salam Gomaa Abdel-Salam, Qatar University

The aim of this study to investigate how well the unobserved ‘student satisfaction’ works as a predictor for Qatar University ‘image’. Using the data of student satisfaction survey (2015–2016), retrieved from the Office of Institutional Planning& Development at QU, a structural model has been tested for undergraduate student satisfaction in relation to the image of Qatar University. The student satisfaction had been assumed to be a function of academic services, student services, IT& admin services, and admin feedback. Correlation& regression analysis, reliability analysis, CFA and finally the SEM has been used to examine the proposed model using AMOS and SPSS. In general, the undergraduate students are found to be satisfied from the services provided Qatar University. Significant direct and mediated effects are observed from student satisfaction related to the services and facilities provided, to the QU image.

Zero-day Intrusion Detection based on Machine Learning Techniques of Hardware Performance Counter Signatures. Authors: Ansam Khraisat, Federation university

Abstract— Researchers have been proposed different techniques of Intrusion Detection Systems (IDSs) that rely on the program behaviors or operating system levels to detect different malware. Most of these techniques are used high semantic features such as function call and system call. However, these high semantic features are still susceptible to malicious attack at an upper privilege level. For instance, a zero day malware may bypass both these levels. In this paper, a low level intrusion detection system based is proposed, the central idea is to use Hardware Performance Counters (HPCs) to build IDSs in order to detect malware at the activating of their execution time. HPCs features are extracted from both malware and normal application to be used in IDS. The effectiveness of the proposed approach has been validated by testing them on a Zeus Malware. The outcome, leading to a detection rate of 99%, which is simply acceptable for intrusion detection system.

An Anomaly Intrusion Detection System Using C5 Decision Tree Classifier. Authors: Ansam Khraisat, Federation university

Due to increase in intrusion activities over internet, many intrusion detection systems are proposed to detect abnormal activities, but most of these detection systems suffer a common problem which is producing a high number of alerts and a huge number of false positives. As a result, normal activities could be classified as intrusion activities. This paper examines different data mining techniques that could minimize both the number of false negatives and false positives. C5 classifier’s effectiveness is examined and compared with other classifiers. Results should that false negatives are reduced and intrusion detection has been improved significantly. A consequence of minimizing the false positives has resulted in reduction in the amount of the false alerts as well. In this study, multiple classifiers have been compared with C5 decision tree classifier using NSL-KDD dataset and results have shown that C5 has achieved high accuracy and low false alarms.

Clustered Deep Learning Algorithm for Traffic Analytics. Authors: Abdulaziz Al-Homaid, HBKU and Abdelkader Baggag QCRI - HBKU

Real-time and reliable trac prediction is a critical problem for intelligent transportation systems in fast-growing cities like Doha. However, large-scale network trac prediction is challenging due to the complex topological dependencies between neighboring and global road segments. The prediction model is required to forecast longer-term futures to reflect congestion propagation in a macroscopic level, and support what-if analysis in case of road removal/maintenance. This would aid the design of peripheral control and would grant the ability to reallocate trac police resources intelligently. In this paper, we are interested in addressing these challenges, via a spatio-temporal matrix formulation for time series prediction. We experiment di erent training techniques on two cities: Anaheim, and Oakland. Our proposed deep neural network training technique eciently demonstrates better performance in terms of speed and accuracy than common training techniques. The code and latent weights are publicly available on Github.

Temporal Regularized Tensor Factorization for Monitoring and Forecasting of Traffic Congestion. Authors: Abdelkader Baggag, QCRI - HBKU, Sofiane Abbar, QCRI - HBKU, Ankit Sharma, Department of Computer Science, University of Minnesota, Tahar Zanouda, QCRI - HBKU, Jaideep Srivastava, Department of Computer Science, University of Minnesota, and Fethi Filali QMIC - HBKU

In this paper, we investigate the problem of missing data in the context of real-time monitoring and forecasting of traffic congestion for road networks. We assume that the city has deployed sensors for speed reading in a subset of edges. Our objective is to infer speed readings for the remaining edges in the network as well as missing values. We propose a tensor representation for the series of road network snapshots, and develop a regularized factorization method to estimate the missing values, while learning the latent factors of the network. The regularizer, which incorporates spatial properties of the road network, improves the quality of the results. The learned factors are used in an autoregressive algorithm to predict the future state of the road network with a long horizon. Extensive numerical experiments with real traffic data from the cities of Doha and Aarhus demonstrate that the proposed approach is appropriate for imputing missing data and predicting traffic states.

Research Roadmap for Automatic Persona Generation: Principles and Open Questions. Authors: Joni Salminen, Bernard J. Jansen, Jisun An, Haewoon Kwak, and Soon-Gyo Jung, QCRI - HBKU

As the quantity of online analytics data has dramatically increased, computational techniques are deployed to make sense of this data. In this perspective manuscript, we propose employing personas as a form of making large amounts of customer analytics information useful to decision makers in software development, business, and other domains where understanding customer behavior is important. Toward this end, we develop a system capable of handling hundreds of millions of customer interactions from tens of thousands of pieces of online content. Our approach identifies customer segments by their online behavior, associates the segments with demographic data, and creates rich persona profiles by dynamically adding characteristics, such as name, photo, and descriptive quotes. This manuscript characterizes the open research questions in automatic persona generation, outlining a research agenda that aims at making data analytics more useful for human decision makers.

DFBICA: A New Distributed Approach For Sentiment Analysis of Bibliographic Citations. Authors: Sadok Ben Yahia, Faculty of Sciences of Tunis

Sentiment analysis of citations in scientific papers is a new and interesting research area. In this paper, we focus on the problem of automatic identification of positive and negative sentiment polarity of citations in scientific papers. In this work, we conducted empirical research to investigate the classification of positive and negative citations. It is based on word vectors as a feature space, to which the examined citation context was mapped to. In order to handle with the huge amount of data, we have implemented our proposed approach in a distributed manner according to MapReduce paradigm through the Hadoop framework.

MORECA: A new contextual recommendation approach based on the Analytic Hierarchy Process (AHP). Authors: Amira Mouakher, University of Tunis El Manar

With the overwhelming growth and complexity of online information, recommender system has been an effective tool to deal with information overload [1]. In fact, such a system has the ability to estimate the user’s interest for a given resource according to collected information from similar user’s preferences. In this paper, we introduce a new contextual recommendation approach based on the Analytic Hierarchy Process (AHP). This work consists in making a bibliographic study on the previous studies which proposed recommendation systems based on the context of the users in the field of films. The goal is to present a new approach to recommend movies based on the user context. Indeed, we rely on methods of multi-criteria decision making and more specifically the Analytic Hierarchy Process for context integration in the recommendation process. Carried out experiments show that our approach obtains very encouraging results in terms of precision and recall.

Internet Traffic Classification using Support Vector Machine and Artificial Neural Network. Authors: Fatima Haouari, Tooba Salahuddin, Nesreen Jboor, Deval Bhamare, and Aiman Erbad, Qatar University

Accurate and timely traffic classification is critical in network security monitoring and traffic engineering. Current trends include using machine learning (ML) techniques for this classification. In this paper, we have considered a recent publicly available dataset to build a variation of traffic classifier models. The purpose of the experiments is to compare the performance of Support vector machine (SVM) and Artificial neural networks (ANNs). Multilayer Perceptron (MLP) and Library for Support Vector Machines (LIBSVM) were used for implementing ANN and SVM respectively using the Weka tool. We conducted supervised learning using reduced features datasets generated by two different feature selection methods. For SVM based classifiers, we observe the impact of kernel functions and penalty parameter C on the results. Furthermore, the effect of varying the number of hidden layers, learning rate and Momentum parameters on the MLP based classifiers is evaluated.

“Carried away by daily work, I, for a while had considered reporting as something that would involve pulling metrics and displaying it.”: Use Cases and Outlooks for Automatic Analytics. Authors: Joni Salminen and Bernard J. Jansen, QCRI - HBKU

The landscape of analytics is changing rapidly. Much of online user analytics, however, is based on collection of various user analytics numbers. Understanding these numbers, and then relating them to higher numerical analysis for the evaluation of key performance indicators (KPIs) can be quite challenging, especially with large volumes of data. There is a plethora of tools and software packages that one can employ. However, these tools and packages require a quantitative competence and analytical sophistication that average end users often do not possess. Additionally, they often do little to reduce the complexity of numerical data in a manner that allows ease of use in decision making and communication. Dealing with numbers poses cognitive challenges for individuals who often do cannot recall many numbers at a time. Here, we explore the concept of automatic analytics by demonstrating use case examples and discussion on the current state and future of automated insights.

The Use of Behavioural Data in Dynamic Credit Risk Assessment: A Case Study of the Qatari Banking Market. Authors: Ahmad Abd Rabuh, University of Portsmouth

This research focuses on the use of psychometrics and personality traits to predict financial decisions, particularly those of whom are applying for loans from a lending institution within the Qatari banking sector. The abundance of data and its five-V characteristics allowed many lending institutions to better-judge one’s ability and willingness to repay a loan. Our research will introduce current credit modelling and methods used to rate individuals within the traditional and analytical frameworks. We will propose a dynamic modelling approach which scores individuals continuously as they change their behaviour. A review of the literature will, also, highlight the work of researchers who were able to eliminate financial risks using streamlined gigantic data. Our method included interviewing five bankers from the State of Qatar, who work in a credit-related department in both types of banks: commercial and Islamic in order to understand the status-quo, opportunities, and challenges.


Cell deformation is regulated by complex underlying biological mechanisms associated with spatial and temporal morphological changes in the nucleus. Quantitative analysis of changes in size and shape of nuclear structures in 3D microscopic images is important not only for investigating nuclear organization, but also for detecting and treating pathological conditions such as cancer. Multiple methods have been proposed to classify cell and nuclear morphological phenotypes in 3D, however, there is a lack of publicly available 3D data for the evaluation and comparison of such algorithms. To address this problem, we present a dataset containing a of total of 1,433 segmented nuclear and 3,282 nucleolar binary masks. We also provide a baseline evaluation of a number of popular classification algorithms using voxel-based morphometric measures. Original and derived imaging data are made publicly available.

Estimating Classifier Accuracy Under Limited Resources. Authors: Sabit Hassan, Shaden Shaar, Bhiksha Raj, and Saquib Razak, Carnegie Mellon University in Qatar

In this paper, we propose strategies to estimate accuracy of classifiers when we cannot obtain true labels for the whole dataset because of limited resources. We want to select a subset of the dataset that we will obtain true labels for in a way such that they will provide us with good estimate of classifier accuracy. We use techniques based on stratified sampling to address this problem. However, stratified sampling poses two challenges: i) how to stratify the data so that sampling from these strata will result in good estimate of accuracy, ii) how to allocate samples among the strata. In this paper, we propose a method of stratifying data and then present two novel algorithms to approximate optimal allocation. Our algorithm for stratification can result in up to 30% reduction in variance and our methods for approximating optimal allocation can result in up to 50% reduction in MAE compared to existing methods. Our methods can also be used to evaluate accuracy of human labeling.