Saeed Moghaddam
Ph.D. Candidate 

Department of Computer & Information Science & Engineering, 
University of Florida
 
 

Saeed Moghaddam is a Ph.D. candidate in Computer Engineering at University of Florida (UF) and a research assistant at Mobile Networking Lab (NOMADS group) under supervision of Dr. Ahmed Helmy. He also collaborates closely with data mining group under supervision of Prof. Sanjay Ranka and marginally works with database group under supervision of Dr. Alin Dobra (on DataPath project) as part of his research. Saeed received his M.S. in Information Technology (Software Development & Design) from Iran University of Science & Technology (IUST) in 2006 and his B.S. in Computer Engineering (Software) from Amirkabir University of Technology/Tehran Polytechnic (AUT) in 2003. He was the director of Advanced Information Technology Lab (AITL) at AUT (2003-2007) and Principle Founder/CEO of Rayneel Inc. (2000-2006) before joining UF in 2008.

In 1996, he received the Elite Student Award at the 6th National Olympiad of Computer and Informatics. In 1998, he was ranked 200th among over 400,000 participants in National University Entrance Exam. In 2000, he was selected as Distinguished Student in ACM Programming Competition at AUT and at the same year he founded Rayneel Inc. In 2001, he received Best Software Award for the best seller software of the month, Recommender System for Academic Studies. In 2003, he finished his B.S. with 2nd ranking and was ranked 25th in National University Entrance Exam for graduate studies. At the same year, his B.S. thesis work on elearning systems received $130,000 ITRC Research & Development Award. He was the leader of ITRC funded project, E-learning Platform, during 2003-2004 in which 15 researcher/developer were involved. In 2005, as the director of AITD, his proposal for National E-learning Platform was one of three ITRC nominees for $800,000 R&D Award. In 2006, he finished his M.S. with the first ranking and was selected as a Distinguished Graduate Student in the Nation. In 2007, he awarded the Membership of National Elites Foundation, the highest prestige scientific and professional foundation in the nation with less than 1000 members. In 2008, he joined UF when he awarded full departmental assistantship support. In 2009, he ranked 1st in Ph.D. qualifying exam with a grade of 81/100 when the passing requirement was 45-50/100. Aside form under submission/progress works, his research has led to 6 published/accepted papers in top-quality conferences/journals including ACM MSWiM, ACM SIGMOBILE MC2R, IEEE MASS/SCENES, IEEE IWCMC and IEEE INFOCOM in 2010/2011. In late 2011, Saeed received the University of Florida Outstanding International Student Award which was given to only 15 students among all students of Engineering Collage.

His research interests include mobile social networks, data mining, data intensive computing, pervasive and ubiquitous computing, human-centered computing, software engineering and e-learning. Currently, he is working on making sense from tera-scale netflow logs for designing new human-centered networking protocols. Saeed has been a program committee member of IC3’10 ,PSC'06, EEE'06, SERP'06, ICWN’6, IKT'05 and the conference manager of IKT'03. He is a member of ACM & IEEE.

Biography
Contact Information

Office: E401 CSE Building

Phone: 352 871 8198

Email: saeed at cise dot ufl dot edu

Education

Ph.D. in Computer Engineering   

University of Florida, Gainesville, FL

Supervisor: Dr. Ahmed Helmy

Thesis: Data-driven Multi-dimensional Analysis and Modeling of Mobile Online Activities and Traffic with Applications in Simulation and Networking Services Design

GPA: 3.83/4

Jan 2008 - Dec 2011 (expected)

 

M.S. in Information Technology (Software)

Iran University of Science and Technology, Tehran, Iran

Supervisor: Dr. Saeed Parsa

Thesis: Context-Aware Information Architecture in Pervasive Environment.

GPA: 17.56/20

Mar. 2006.

 

B.S. in Computer Engineering (Software)

Amirkabir University of Technology, Tehran,Iran

Thesis: Analysis, Design and Development of an E-learning Platform

Supervisor: Dr. Kazem Akbari

GPA: 16.37/20

Sep. 2003

Saeed Moghaddam

Saeed Moghaddam

Saeed Moghaddam

CV

Activity-Based Mobile Social Networking (2008-present)

Mobile social networks are going to create new models of interactions among people. Currently, many people are using online communities to get connected to their friends or make new friends. The trend is even expected to grow with the emergence of mobile social networks. Social network analysis views social relationships in terms of nodes and ties. Nodes are the individual actors within the networks, and ties are the relationships between the actors. There can be many kinds of ties between the nodes such as values, visions, ideas, financial exchange, friendship, kinship, dislike, conflict or trade. Among different ties, we are mostly interested in those which can be inferred form network traces. Using Net Flow and WLAN traces, we are able to extract information on people location and their request over the network (i.e. all their interactions with different websites or other resources). The raw information on users actions can be directly analyzed to find similarities, patterns and trends and applied as indicators of social ties based on low level user actions. However, it can also be used to create a higher level model of social ties based on people activities. According to Activity Theory, an activity can be modeled as a set of human actions. Thus, we would probably be able to characterize people activities over the network based on their request actions. Activities might also be investigated further to create another model of social ties rooted in people identities. Establishing new types of ties, we are going to design new methods and protocols for the sake of sharing, searching, advertising and accessing different information over a mobile social network.

Spatio-Temporal Modeling of Wireless Users Internet Access Patterns Using Self-Organizing Maps

Saeed Moghaddam, Ahmed Helmy

Computer and Information Science and Engineering (CISE) Department, University of Florida, Gainesville, FL

{saeed, helmy}@cise.ufl.edu

Abstract—User online behavior and interests will play a central role in future mobile networks. We introduce a systematic method for large-scale multi-dimensional analysis of online activity for thousands of mobile users across 79 buildings over a variety of web domains. We propose a modeling approach based on self-organizing maps (SOM) for discovering, organizing and visualizing different mobile users’ trends from billions of WLAN records. We find surprisingly that users’ trends based on domains and locations can be accurately modeled using a self-organizing map with clearly distinct characteristics. We also find many non-trivial correlations between different types of web domains and locations. Based on our analysis, we introduce a mixture model as an initial step towards realistic simulation of wireless network usage.

Keywords- self-organizing map; data-driven; trend; wireless; simulation.

I.INTRODUCTION

Wireless mobile networks are evolving and integrating with every aspect of our lives. Laptops, handhelds and smart phones are becoming ubiquitous providing almost continuous Internet access. This creates a tight coupling between users and mobile networks where various characteristics of user on-line behavior affects network performance significantly. Hence, there is a compelling need for analysis and modeling of mobile users Internet access patterns. Such behavioral modeling will aid in the understanding of users trends and the load distribution on the network, and thus inform the design of important classes of applications, including modeling and scenario generation for network simulations, network capacity planning, web caching and behavior-aware networking protocols [HDH08] to name a few. However, such behavioral analysis on extensive traces of Internet access is difficult, as it is large-scale, computationally costly and time-consuming. Such traces usually contain billions of records and therefore, the analysis may even be impossible when the dataset exceeds certain limits of size and complexity. Moreover, accommodating different aspects of users behavioral patterns (e.g. mobility, website visitation, time, application) is not a trivial task. As a result, it is imperative to establish systematic and scalable methods for the analysis and modeling of massive multi-aspect/ multi-dimensional datasets of mobile users’ online activities.

Much of the previous mobility or web usage modeling focused on individual behavior and one aspect. While individual behavior is important, investigating group behaviors and trends is more challenging and involved. In this paper, we focus on group behavioral modeling, and study behavioral correlations based on groups of mobile users and their trends according to the website and location visitation patterns. We show how the modeling and analysis can be accomplished based on different aspects (i.e. web domains or locations) separately or all together in a unique structure. While existing models (e.g. random, uniform) do not accommodate any of the discovered correlations, our approach provides a solid foundation to model multiple important aspects of users behavioral patterns for realistic design of future mobile network,   

On the other hand, the conventional approach to group data analysis, i.e., direct clustering of data items based on some similarity measure, is not applicable for massive multi-aspect analysis. The common clustering algorithms are either computationally intensive, e.g., hierarchical clustering [JD88], or very unstable, e.g., k-means [JD88] for huge amount of data. Moreover, although the resulting clusters provide some insight on similarities between data items (and among features, if co-clustering [DG03] is used), they do not reveal intuitively how correlated they are.  Therefore, they are not effectively useful for discovering correlations when there exist many data items, features and aspects.

Our approach to address the above limitations is to apply self-organizing maps (SOM) coupled with our proposed feature clustering technique for uni-aspect modeling and our suggested extension on top of the SOM method for multi-aspect modeling. A self-organizing map is an artificial neural network which is trained using unsupervised learning to generate a discretized low-dimensional representation (a map) of the input space of the data samples. Unlike other artificial neural networks, self-organizing maps use a neighborhood function to preserve the topological properties of the input space. The topology-preserving mapping keeps the more similar data groups closer together in the final map, which makes SOM useful for providing low-dimensional views of high-dimensional data. These views can reveal the semantic behind major user trends and the correlation between different features.

In this study, we apply SOM in a novel way on a dataset provided by the processing of extensive netflow, DHCP and MAC trap traces for more than 22 thousand mobile users in a Wireless LAN spanning over 79 buildings and including over 700 APs, that we have collected. This dataset, including billions of records, represents by far the largest set of traces analyzed in any study of mobile networks to date.  Using this technique, we extract minor and major trends in mobile users’ website/location visitation patterns and important correlations between different web domains and locations. We show how to apply this technique methodically to the collected large-scale multi-dimensional dataset with minimal computational complexity to facilitate its meaningful analysis. Our method is systematic and can be generally applied to discover important features of Internet behavior from other similar traces. It can also be applied for any other aspects of Internet usage, e.g., time, application in addition to the web domains and locations.

We report three major findings in this paper. First, mobile users’ access patterns based on web domains and locations can be accurately modeled by a small set of neurons which can be further clustered into smaller number of major trends with clearly distinct characteristics. For example, a major trend represents Mac users who frequently visit ‘mac’ and ‘apple’ websites and have strong interest in ‘washingtonpost’ and ‘cnet’ too.  Second, web domains / locations in similar categories tend to be modeled by an adjacent set of neurons. For example, most of the advertisement/marketing domains or fraternities are modeled together by neighboring neurons. Third, many nontrivial correlations exist between different kinds of web domains and locations. For example, we found that Music Practice Center is highly correlated to the Health Science Telephone Vault while they are located in two different campuses of USC. Considering these findings, we need a new model for mobile user behavioral patterns to accommodate future mobile applications, as we discuss in our applications section.

Our work has the following key contributions:

1.We propose an effective approach for multi-dimensional analysis of one of the largest set of mobile network usage traces (billions of records) and show how self-organizing map can be applied to model minor and major trends in mobile users’ access pattern based on web domains or location.

2.We conduct domain-specific and location-based analysis of mobile users’ behaviors, using the feature maps extracted from the SOM and show how this method can be effectively applied to discover correlations among different domains/locations.

3.We suggest feature clustering technique on top of the SOM as a quantitative way for discovering feature correlations.

4.We propose an extension on top of the SOM for multi-aspect modeling and analysis of users behaviors.

5.We show how our modeling approach can be effectively applied to determine parameters of a Gaussian Mixture Model which is used for data simulation.  

The rest of the paper is organized as follows. In Section 2, we review the related work.  In Section 3, we briefly address challenges associated with the collection and processing of large-scale wireless traces and then explain our modeling approach in detail. Section 4 and 5 present our data analysis and data simulation technique and Section 6 provides our case study using campus traces. Section 7 discusses modeling and applications. Section 8 concludes.

II.RELATED WORK

The rapid growth of wireless communication technologies has led to a widespread interest in analyzing the traces to understand user behavior. The scope of analysis includes WLAN usage and its evolution across time [KE02], [HKA04], traffic flow statistics [MWYL04], user mobility, [BC03],  user association patterns [PSS05] and encounter patterns [HH06]. Some previous works, [HH06] attempt to understand user behaviors empirically from data traces. The two main trace libraries for the networking communities can be found in the archives at [ML09] and [CR09]. None of the available traces provides large-scale netflow information coupled with DHCP and WLAN sessions to be able to map IP addresses to MAC addresses and detect locations. Therefore, (to the best of our knowledge) our work is the first one to address large-scale multi-dimensional modeling of wireless networks.   We analyze wireless data around three orders of magnitude above any existing study, providing richer semantics, finer granularity and potentially more accurate models. In addition, our work includes novel data analysis techniques to address the challenges provided by this large-scale multi-dimensional data.

There are several noticeable examples of utilizing the data sets for context specific study. Mobility modeling is a fundamentally important issue, and several works focus on using the observed user behavior characteristics to design realistic mobility models [HSPH09], [KKK06]. They have shown that most widely used existing mobility models (mostly random mobility models, e.g., random waypoint, random walk; see [BH06] for a survey) fail to generate realistic mobility characteristics observed from the traces.  Realistic mobility modeling is essential for protocol performance. It has been shown that user mobility preference matrix representation leads to meaningful user clustering [HDH07]. Several other works with focus on classifying users based on their mobility periodicity [KK07], time-location information [EP06], or a combination of mobility statistics. The work on the TVC model [HSPH09] provides a realistic mobility model for protocol and service performance analysis. Our work is complementary to TVC and can extend TVC dramatically to incorporate dimensions of load, interest and website visitation preferences. In [MWYL04] it is shown that the performance of resource scheduling and TCP vary widely between data-driven and non-data-driven model analysis. Using multi-dimensional modeling, our method can develop new mobility-aware Internet-usage models, and utilize the realistic profiles to enhance the performance of networking protocols. Our new application of self-organizing map technique may be extended to incorporate online activity, location and mobility, and provides user profiles that may be used in a myriad of networking applications.

One network application for multi-dimensional modeling is profile-based services. Profile-cast [HDH08] provides a one-to-many communication technique to send profile-aware messages to those who match a behavioral profile. Behavioral profiles in [HDH08] use location visitation preference and are not aware of online activity. Other previous works also rely on movement patterns. Our multi-dimensional modeling of mobile users, however, provides an enriched set of user attributes that relate to social behavior (e.g., interest, community as identified by web access, application) that has been largely ignored before.

III.MODELING APPROACH

Realistic modeling of large wireless networks requires three main phases to collect, process and model multi-dimensional large datasets with fine granularity. In the first phase, extensive datasets are collected using the network infrastructure which may be augmented using online directories (e.g., buildings directory, maps) and the web services (e.g., whois lookup service). Data processing is the second phase to cross-correlate obtained information from different resources (e.g., IP and MAC addresses), in which multiple datasets are manipulated, integrated and aggregated. The final phase is data modeling which includes uni-aspect and multi-aspect modeling of users’ behaviors based their web domain and location visitation patterns.

A.Data Collection

We collect different types of extensive traces via network switches (in USC campus) including netflows, DHCP and wireless session logs. An IP flow is defined as a unidirectional sequence of packets with some common properties (e.g., source IP address) that pass through a network device (e.g., router) which can be used for flow collection. Network flows are highly granular; flow records include the start and finish times (or duration), source and destination IP addresses, port numbers, protocol numbers, and flow sizes (in packets and bytes) (see Table 1). The source and destination IP addresses can be used to identify user device Mac addresses using DHCP log and the websites accessed respectively. The DHCP log contains the dynamic IP assignments to MAC addresses and includes date and time of each event. This information is needed to get a consistent mapping of dynamically assigned IP addresses to the device MAC addresses. The wireless session log collected by each wireless access point (AP) includes the ‘start’ and ‘end’ events for device associations (when they visited or left that specific AP) which can be used to derive the location of users at any time. 

Table 1 – Netflow Sample

Start TimestampFinish TimestampSource IPSource

PortDest IPDest PortProtocol NumToSPacket CountFlow Size

0618.00:00:07.1840618.00:00:07.184128.125.253.14353207.151.245.121642091701469

0618.00:00:07.1840618.00:00:07.472207.151.241.605275974.125.19.17806041789

0618.00:00:07.1880618.00:00:07.188193.19.82.931676207.151.238.90437981701103

B.Data Processing

The variety and scale of different collected traces introduces one of the main challenges with respect to data processing. The size of the underlying data is very large and therefore, with a naïve approach the required time for this task would be in the order of months. For example, the netflow dataset gathered from USC campus includes around 2 billions of flow records for each month in 2008 which equals to 2.5 terabytes of data per year. To circumvent the problem, we first compress the data via substituting similar patterns with binary codes and creating mapping headers to be used in the analysis step; then get the data exported into a database management system (MySQL) and design customized stored procedures for data integration (mapping source IPs to Mac addresses (user IDs) and destination IPs to domain names). In the last step, we aggregate the integrated data based on user ID, domain name, location and month and calculate the total online time for each resulting record. 

C.Data Modeling

The data modeling phase includes two major parts. In the first part, we employ the self-organizing map to learn minor trends of users within the wireless society. The user trends may be learned based on website or location visitation preferences separately (using a common approach) or together (based on our proposed multi-aspect extension of the method). In any case, in the second part, we apply clustering technique on the map nodes to discover major trends inside the community.

1)Trend Modeling

   The SOM technique [K82] provides a powerful yet intuitively understandable tool for unsupervised learning and data visualization. The SOM is defined as a set of nodes which develop a mapping of high-dimensional input vectors (which may represent website or location visitation preferences) onto a discrete output space (the “map”) such that each region on the map represents an area of the input space. This mapping preserves the topology of the input space in a way that local similarity of input patterns is reflected by proximity on the map. Therefore, it can be effectively applied in capturing the properties of the input space of users’ behaviors and organizing their trends in an ordered fashion. In a self-organizing map, a weight vector of the same dimension as the input data vectors and a position in the 2 D map space are associated with each node (or neuron in neural networks). The usual arrangement of nodes is in the form of a hexagonal or rectangular grid. SOM training, i.e., the iterative adjustment of the weight vectors to acquire a desired mapping, is performed by successive presentation of all input data where each presentation leads to the adjustment of weights to the presented data. The training is based on two principles:

a) Competitive learning: the weight vector most similar to a data vector is modified so that it is even more similar to it (the corresponding node is called Best Matching Unit or BMU). This way the map learns the position of the input data.

b) Cooperative learning: not only the most similar weight vector, but also its neighbors are moved towards the data vector. This way the map self-organizes.

The neighborhood function h regulates the weight changes based on the map distance between BMU and the neuron being adapted. In the case of a Gaussian shaped neighborhood function, the expression of h is given by:


where distmap(i,j) measures the distance on the map between two neurons and r(n) is a global parameter that controls the “width” of the neighborhood function. According to this expression, the amount of the changes is maximum for the BMU and decreases for nodes that are far from it. The value of r(n) decreases with the number of iteration; a relatively large radius during the initial iterations allows the map to quickly organize the neurons, while a smaller value toward the end determines localized changes in a way that different parts of the map become sensitive to different input features. The learning rate of the map decreases monotonically with the number of iterations to ensure convergence.

In this way, each neuron can learn a minor trend that represents a set of similar input data vectors. This is one of the major advantages of SOMs with respect to clustering techniques. While a clustering technique attempts to partition the input space (e.g. users’ behaviors) by assigning each sample (e.g. a user) to a cluster, the SOM technique attempts to learn trends inside the input space form the samples. Note that each input data vector (e.g. a user) affects a set of neighboring neurons (trends) and therefore the input space is not distinctly partitioned by the neurons (unlike cluster assignment in conventional clustering techniques).  This is much more in consent with the natural human behaviors with no clear-cut distinctions. 

Map Creation - The side lengths of the map grid are determined based on the ratio of two biggest eigenvalues of the training data. For initializing the SOM, first, linear initialization along two greatest eigenvectors is attempted, but if the eigenvectors cannot be calculated, random initialization is used instead.  After the initialization, the SOM is trained by normalized input data. The normalization of the input features is very important in determining what the map will be like. If the ranges of value for some features are much bigger than the others, those features will probably dominate the map organization completely and the resulting map will not be useful. The computational complexity of SOM algorithm scales linearly with the number of data samples and quadratically with the number of map units.

2)Trend Clustering

One way to visualize the resulting map after the training phase is to create U-matrix (unified distance matrix). The U-matrix shows the distance between the weight of each node and the assigned weights of its neighbors after the learning process. Fig. 1(a) shows an instance of U-matrix with interpolated shading of colors. Small U-values (Blue areas in the figure) indicate homogenous neighborhoods and large ones (Red areas) depict heterogeneous neighborhoods. As large U-values mean large distances between the neighboring nodes, they can be interpreted as borders between clusters of neurons, i.e., trends. In order to find these borders (clusters), k-means clustering algorithm can be applied. Because k-means result depends on the initial choice of cluster centroids, the algorithm is run multiple times for a given k and  then the best result is selected based on the sum of the squared errors. Because the captured minor trends are already very well organized on the map, each resulting cluster maps into a contiguous area of neurons, representing a major trend (Fig. 1(b)). Clustering of trends instead of original data reduces the required computational time for any kind of clustering technique as the size of input is decreased. This is very important when dealing with massive amount of data. In addition, as the weight vectors are local averages of the data, the clustering result is less sensitive to random variations in the input data. 

3)Multi-Aspect Modeling

The SOM technique in its original form is suitable for uni-aspect modeling of trends, i.e., based on either web domain or location aspect. However, in multi-aspect modeling the goal is concurrent modeling of trends based on all aspects together not separately. While uni-aspect modeling is good for intra-aspect analysis, multi-aspect modeling provides an opportunity for inter-aspect analysis. In multi-aspect modeling, instead of one general usage vector per user, a set of localized vectors exists (i.e., a set of usage vector for different locations or vice versa). Therefore, the regular SOM learning method is not applicable. Our proposed approach to accommodate this situation is to consider a usage matrix for each user (representing website usage at different locations) and a weight matrix for each map unit and then get the map trained. This way each map unit can capture a multi-aspect trend. This is an extensible approach and can be applied for more than two aspects.      

IV.DATA ANALYSIS

We conduct uni-aspect and multi-aspect analysis considering web domain and location visitation aspects. For each of the analysis, we propose two qualitative and quantitative approaches. The qualitative approach relies on the visual inspection of extracted feature maps and is useful for discovering correlations, anti-correlations and anomalies among the features. The quantitative approach is based on our proposed feature clustering technique which applies a mathematical correlation function.

1)Feature Map Analysis

The feature maps are extracted from the SOM and show what kind of values the weight vectors of the map units have for each feature. In other words, a feature map shows the projection of the SOM for the corresponding feature (which can be a web domain, a location, or a web domain at a specific location in multi-aspect case). The value of each unit for the feature is presented with a color. Fig. 2 shows a group of resulting feature maps in our study. By visual inspection of the feature maps, we can find many different interesting facts about the trends and features as folows:

a) Comparison of feature maps with the clustered SOM discovers the semantic behind each cluster of trends representing a major trend. For a cluster area, features whose maps looks red in the same area disclose the main captured trends by the cluster.

b) Similar feature maps reveals correlations between the corresponding features. The correlation can be partial or complete. If the maps seem highly similar, there exist rather complete correlation, but if they are partially similar the correlation among features will also be partial. In our case, correlation between a set of features means that they have the same visitation pattern.

c) Anomalies in a set of feature maps uncover anomalies regarding the corresponding features. In our case, for example, if for a category of web domains (e.g. marketing domains) all but one feature maps looks similar; the different one brings out an anomaly. 

d) Feature maps which look inverted (i.e. red areas in one are blue in the other) disclose anti-correlations. Again, anti-correlations can also be rather complete or partial.

2)Feature Clustering

Taking the projection of all weight vectors (or weight matrix in multi-aspect case) on each feature, we proposed to construct a description vector for the corresponding feature referred to as feature vector.  Applying hierarchical clustering on the feature vectors, we can cluster features based on their correlation using the following correlation distance function:


where vi and v j are feature vectors. This procedure can also be interpreted as a quantitative way for comparing the feature maps.

V.DATA SIMULATION

Once the SOM model has been established, in addition to its use in analysis, we utilize it to generate accurate simulation data. The acquired trends captured by the SOM can be considered as the generative components for a Gaussian Mixture Model (GMM) [MP00]. Mixture models are a type of density models which comprise a number of component functions, usually Gaussian. A mixture of K Gaussian is:


where αk is the mixing parameter satisfying Σαk = 1 and G(x,μk,Σk) is the probability density function (pdf) for the kth Gaussian component. The Gaussian mixture model contains the following adjustable parameters: αk, μk and Σk. A simulated data point can be generated by first choosing one of these multivariate Gaussians (with the probability of αk) and then sampling based on the parameters for the chosen distribution (μk and Σk). However, we need to first find these parameters appropriately before being able to simulate data. [AHV99] proposes a technique on top of the  SOM and shows that it outperforms others for the estimation of GMM parameters. In this technique, each map node is considered as a center of a Gaussian kernel, the parameters of which are estimated from the assigned data to the node and its neighbors. The parameter αk is also determined using a weighted number of assigned data to the node and the neighbors. The advantage of this technique is the topological ordering of the SOM and the available neighborhood function which can be applied to get a weighted contribution of data from the neighboring nodes as well as the node itself for the estimation of the parameters.

VI.CASE STUDY AND EXPERIMENTAL RESULTS

In our case study, we conduct a campus-wide case study on the data we collected from the University of Southern California (USC) in 2008 based on the approach and techniques explained in the previous section.

A.Data Processing Details

The netflow and DHCP traces from the USC campus (over 700 access points covering 79 buildings) were processed to identify mobile user IDs (MAC addresses), and destinations, or ‘peers’ (usually web servers) using IP address prefixes. Over a billion records (for Mar. 2008) were considered. Then, the IP prefixes (first 24 bits) were filtered using a threshold of 100,000 flows [the reason for using 24 bits filter is that popular websites usually use an IP range instead of a single IP address].  For the filtered IP prefixes, their domains were resolved.  Among the resolvable domains, the top 100 active ones were identified and all the users interacting with those domains were considered for the analysis.

B.Modeling Results

For domain specific and location-based modeling, two separate matrices were created associating the user IDs with web domains and user IDs with locations using the corresponding total online time (per minute). For our analysis, we had 22,816 users, and 100 domains and 79 buildings. The data for each matrix is scaled using row-normalization of log the online time values. The two input matrices trained two SOMs of 32 by 24 nodes separately. Fig. 1 (a) shows the U-matrix for domains and Fig. 1 (b) represents the corresponding SOM clustered into 20 clusters. For multi-aspect modeling, we chose the highest active 40 domains and 20 locations and created a 3D matrix   associating the user IDs, web domains and locations using again the corresponding online time and trained a 32 by 24 matrix-based SOM (see Appendix A for location and multi-aspect maps).


(a) U-matrix(b) Clustered SOM

Fig. 1. U- matrix and clustered SOM maps for WLAN Internet usage in USC campus (for domains).

C.Analysis Results

1)Domain Specific Analysis

The feature maps were created for all the domains. Fig. 2 and Fig. 3 show several examples of resulting maps for different types of web domains. Inspection of the feature maps reveals many interesting facts. The following are some examples based on the presented feature maps here.

a) Fig. 2 shows feature maps for advertisement and marketing domains. All these maps (except the right one) show a red area almost at the same neighborhood (right-bottom corner). This shows the major trend captured by the cluster depicted by orange at the same area in Fig. 1(b) is toward this kind of web domains. 


fastclicktribalfusioncoremetricsdoubleclick

Fig. 2. Feature maps for advertisement and marketing domains

(important notice: maps need to be viewed in color)

b) High similarity between feature maps in Fig. 2 shows that the corresponding domains for advertisement and marketing are highly correlated. We can also observe high correlations between the following groups of domains from Fig. 3: i) security related domains, i.e., ‘mcafee’ and ‘hackerwatch’; ii) ‘itunes’ and ‘netflix’ (online media); iii)’mac’, ‘apple’, ‘washingtonpost’ and ‘cnet’ (showing a strong trend of Mac users toward ‘cnet’ and ‘washingtonpost’); iv) Windows related domains, i.e, ‘microsoft’, ‘windowsmedia’ and ‘microsoftoffice2007’. In the figure, we can see that ‘itunes’ is in one hand partly correlated to ‘netflix’ and on the other hand is partly correlated to ‘mac’ and ‘apple’.  This may show the facts that i) Mac users dominantly use iTunes for online media and ii) Netflix costumers shop in iTunes store too.

c) Different patterns of maps for ‘doubleclick’ among advertisement and marketing domains show an example of anomalies within a category of web domains. These anomalies might disclose different advertisement and marketing approaches taken by ‘doubleclick’. 

d) Fig. 3 reveal anti-correlation between Mac and Windows related domains. As can be noticed, the bight (red) area for ‘apple’ and ‘mac’ is almost dark (blue) for ‘windowsmedia’, ‘microsoftoffice2007’ and ‘microsoft’. We can also find anti-correlation between security related domains (i.e, ‘mcafee’ and ‘hackerwatch’) and ‘mac’ and ‘apple’, but partial correlation between them and Windows related domains.

mcafeehackerwatchnetflixitunes

applemacwashingtonpostcnet

microsoftwindowsmediamicrosoftoffice2007

Fig 3. Feature maps for various types of domains

We also applied our proposed feature clustering technique on top of the SOM for web domains and created 20 clusters. Table 2 shows some of the resulting clusters. As can be seen in the table, the two discussed categories of Apple and Microsoft correlated domains are clustered into two distinct clusters (Clusters A and B). 

2)Location-Based Analysis

Similar to the domain-specific analysis, we can simply find the semantic behind each major trend for location visitations (see Appendix A). Inspection of the feature maps for the locations reveals many interesting correlations too. Fig. 4 shows high correlations between social and professional fraternities. As can been seen, fraternities in the first row are highly correlated. We can also observe high correlation between ATO and ARC. The feature map for PGD (playground) shows that both groups are partially correlated with the playground duplex too.

Similar to domain-specific analysis, discovered correlations among locations are not just between buildings of the same types. Fig. 5 shows four pairs of highly correlated buildings which are not in the same category. As can be seen in the figure, the Music Practice Center (PIC) is highly correlated to Health Science Telephone Vault (HSV). The interesting point about these two buildings is the fact that they are located in two different campuses of USC and so relatively far from each other. However this is not the case for the Woman’s Association (YMC) which is next to the Hall Building (HSH) and probably use the hall for their gatherings very frequently. We can also see the residents of housing complex TRH frequently go to the Healthcare Consultation Center (HCT).  Also, fraternity PKT and sorority KAT are highly correlated which may reveal the fact that many of their members are in a relationship.

Table 2 – Feature clustering result on web domains

ClusterDomainClusterDomain

Aapple mac cnet washingtonpost  itunes earthlink Bmicrosoft  windowsmedia  microsoftoffice2007 mcafee  hackerwatch  quiettouch

Cgoogle  mozilla  nih Dlive  hotmail  net  hamachi

Eveoh  secureserverFcomcast  fastwebnet

Gtorrentbox  rr Hsmartbro aster  fastres  opendns

We also employed feature clustering on the location SOM and created 20 clusters. Fig. 6 shows clustered heatmap of pair-wise correlation matrix for all the buildings. Darker blocks along the main diagonal in the figure show the fact that buildings within each cluster are highly correlated together but not much to the rest. To analyze the clusters, we studied all the buildings and based on their actual context categorized them into 10 categories including: housing, auditorium, (outdoor) activity, sorority, fraternity, school, health, music, cinema and service. These categories are available to the left of each abbreviation in the figure. As can be seen, many of the buildings in the same category are clustered together. For example many of fraternities and all sororities are placed in cluster 1 (cluster IDs are available at the right side of heatmap). We can also observe that 4 building in ‘’activity’’ category and 7 ones in ‘’health’’ category are clustered into clusters 5 and 8 respectively (“activity” category includes buildings with different activity context including sports, religion, social and shopping). We can also see that all of the discussed correlated buildings in Fig. 4 and Fig. 5 are also clustered into the same clusters. 

3)Multi-Aspect Analysis

Fig. 7 shows some examples of highly correlated domains at specific locations. The maps at the first row reveal a non-trivial correlation between visitation of ‘yahoo’ at ANH housing complex and ‘live’ at KAT sorority. We can also observe partial correlation between this patterns and visitation of ‘mozilla’ and ‘google’ at ATO fraternity. The second row shows two other examples of multi-aspect correlations: i) visitation of  ‘youtube’ at LUC (cinema) and ‘live’ at ASC (Communication & Journalism school); ii) visitation of ‘usc’ at ACO (sorority) and ‘yahoo’ at PKF (fraternity).

One point in multi-aspect analysis is the fact that inspecting many feature maps for all the combination of aspects (in our case 800 combinations of 40 domains and 20 locations) is rather difficult. This was actually one of our motivations for designing the feature clustering technique. By employing this technique, we can easily cluster all the maps and then use the visual inspection of feature maps for detailed analysis. Fig. 8 shows the top-left quarter of acquired clustered heatmap for multi-aspect analysis (80 clusters in total) (see Appendix B for the complete map).  

AKP (business)DSPSPD(engineering)CPF

ARC(architecture)ATOPGD    Playground Duplex

Fig. 4. Feature maps for social & professional fraternities

PIC

Music Practice HSV

Telephone VaultYWC

Women's AssociationHSH

Hall

HCT

Healthcare Consultation TRH

HousingPKT

FraternityKAT

Sorority

Fig. 5. Feature maps for various types of locations

Fig. 6. Clustered heatmap for all the buildings as features. Rows and columns represent building IDs and lines indicate cluster borders.  Numbers at the right show cluster IDs and descriptions at the left include building abbreviation and category for each row. Darker colors show more correlation.

yahoo@ANH Housinglive@KAT

Sororitymozilla@ATO Fraternitygoogle@ATO Fraternity

youtube@LUC Cinemalive@ASC

Schoolusc@ACO

Sororityyahoo@PKF Fraternity

Fig. 7. Multi-aspect feature maps for domain-location


Fig. 8. Clustered heatmap for multi-aspect feature analysis (zoomed on top-left quarter of the map)

D.Simulation Parameter Estimation

We applied the technique explained in Section 5 for estimating the required parameters of GMM required for data simulation. Fig. 9 shows the acquired probability density (α parameter) for the nodes of domain SOM as the generative components of the GMM. This parameter along with the acquired parameters of distributions for each component is used for data simulation.


Fig. 9. Probability density estimation for domain SOM nodes as generative components of Gaussian mixture model

VII.DISCUSSION: MODELING AND APPLICATIONS

The systematic realistic mining method proposed in this paper can be applied with any set of wireless data to reveal significant facts that may be used in several important applications in mobile networking research. Here, we briefly address three such major applications:

1- Modeling and simulating spatio-temporal web usage for mobile users: Network simulations are imperative for the design and evaluation of mobile networks (e.g., ns-2). To provide realistic input to the simulations, realistic models of users’ behaviors are required, along with scenarios of events and dynamics of mobility, traffic and Internet access. While earlier work has focused on mobility and traffic modeling, we provide the first work on modeling of mobile Internet usage. The parameters of online activity along with trend characteristics and correlations in the simulation can be easily derived from our model in this paper. None of the existing models captures such characteristics across website access. Recreating network usage more accurately will result in significantly different mobile node density, load, and similarity distributions from those created by today’s models. Developing and releasing the code for the mobile Internet access model is part of our future work. Similarly, we plan to conduct an extensive study on the spatio-temporal parameters for mobile traffic modeling in future.

2- Interest-based protocols and services: A new class of protocols and services center around user-interest and similarity, including profile-cast, participatory sensing, trust establishment, location-based services, crowd sourcing, alert notification and targeted announcements and ads. So far, mobility patterns (e.g., in profile-cast) have been used to infer interest. Website access patterns can remarkably enhance the accuracy of interest inference and provide much needed granularity for these protocols and services. The interest models developed based on our analysis can help both the informed design of such efficient protocols and the realistic evaluation thereof.

3- Network planning and web caching: Load distribution on the network is imperative for network capacity planning and on-going configuration and management issues, and is definitely related to web access patterns and its characteristics. Also, the caching of web objects for mobile users can only be efficient if informed by the history of access patterns. These applications for mobile networks are becoming more compelling with the significant growth of usage of smart phones, iphones, ipads, and the like.

VIII.CONCLUSION


(a) Locations U-matrix(b) Locations Clustered SOM


(c) Multi-aspect U-matrix(d) Multi-aspect Clustered SOM

Fig. 10. U-matrix and clustered SOM for WLAN Internet access in USC campus.

This study is motivated by the need for developing realistic models and efficient protocols for the future mobile Internet. We provided a systematic method to analyze the largest wireless trace to date, with billions of records of Internet usage from a campus network, including thousands of users. Novel modeling and analysis were conducted utilizing self-organizing maps and our proposed extensions to the technique for multi-aspect trend modeling based on web domains and locations. We have shown that mobile Internet usage can be modeled with an organized map of trends which can be effectively used to find correlations and to simulate data. The details of our study enable the parameterization of new and realistic models for mobile Internet usage with applications in several areas of networking, including mobile web caching, simulation and evaluation of protocols, interest-aware services and network planning, to name a few. We hope for our modeling method to provide an example for large-scale data-driven modeling of mobile networks in the future. With more measurements from mobile and sensor networks becoming available, we expect our method to facilitate analysis of many other large                                                                                                                                                                                                                                                            datasets in future studies.

ACKNOWLEDGEMENT

This work was partially funded by NSF award number 0832043.

APPENDIX

A.Locations and Multi-Aspect SOM

Fig. 10 shows U-matrix and clustered SOM for location based and multi-aspect analysis.

B.Multi-Aspect Clustered Heatmap

Fig. 11 shows the complete clustered heatmap for multi-aspect feature analysis on 40 domains and 20 locations. The black area depicts clusters with very small number of members.

REFERENCES

[AHV99] Alhoniemi, E., Himberg, J., Vesanto, J., “Probabilistic measures for responses of Self-Organizing Maps”,Computational Intelligence Methods and Applications, 1999, Rochester, N.Y., USA, 286-289.

[BC03] Balazinska, M., and Castro, P., "Characterizing Mobility and Network Usage in a Corporate Wireless Local-Area Network," ACM MobiSys, May 2003.

[BH06] Bai, F., and Helmy, A., "A Survey of Mobility Modeling and Analysis in Wireless Adhoc Networks", Book chapter, Springer, Oct 06, ISBN: 978-0-387-25483-8.

[CR09]  http://crawdad.cs.dartmouth.edu/index.php.

[DG03] Dhillon, I. S. and Guan, Y., “Information theoretic clustering of sparse co-occurrence data”, IEEE Intl Conf on Data Mining, Washington, DC, USA, 517, 2003.

[EP06] Eagle, N.,  and Pentland, A., "Reality mining: sensing complex social systems," in Journal of Personal and Ubiquitous Computing, vol.10, no. 4, May 2006.

[HDH07] Hsu, W., Dutta, D., and Helmy, A., "Mining Behavioral Groups in Large Wireless LANs", ACM MOBICOM, pp. 338-341, September 2007.

[HDH08] Hsu, W., Dutta, D., and Helmy, A., "Profile-Cast: Behavior-Aware Mobile Networking", IEEE WCNC, pp. 3033-3038, March 2008.

[HH06] Hsu, W., and Helmy, A.,"On Nodal Encounter Patterns in Wireless LAN Traces", IEEE WiNMee, 2006.

[HKA04] Henderson, T., Kotz, D.,  and Abyzov, I.,  "The Changing Usage of a Mature Campus-wide Wireless Network,”" ACM MobiCom, September 2004.

[HSPH09] Hsu, W., Spyropoulos, T.,  Psounis, K., and Helmy, A., “TVC: "Modeling Spatial and Temporal Dependencies of User Mobility in Wireless Mobile Networks”, IEEE/ACM Transaction on Networking pp. 1564-1577, October 2009.

[JD88] Jain, A. K., and Dubes, R. C, Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ,1988.

[K82] Kohonen, T., "Self-organized formation of topologically correct feature maps", Biological Cybernetics, 43(1):59–69, 1982.

[KE02] Kotz, D., Essien, K.,  "Analysis of a Campus-wide Wireless Network," ACM MobiCom, September, 2002.

[KK07] Kim, M.,  and Kotz, D., "Periodic properties of user mobility and access-point popularity," Journal of Personal and Ubiquitous Computing, 11(6), Aug. 2007.

[KKK06]  Kim, M., Kotz, D., and Kim, S., "Extracting a mobility model from real user traces," IEEE INFOCOM, Apr. 2006.

[ML09] http://nile.cise.ufl.edu/MobiLib/

[MP00] McLachlan, G.J. and Peel, D. Finite Mixture Models, Wiley (2000).

[MWYL04] Meng, X., Wong, S., Yuan, Y., and Lu, S., "Characterizing Flows in Large Wireless Data Networks," ACM MobiCom, September 2004.

[PSS05] Papadopouli, M.,  Shen, H., and Spanakis, M.,  "Characterizing the Duration and Association Patterns of Wireless Access in a Campus," 11th European Wireless Conference 2005, Nicosia, Cyprus, April 10-13, 2005.

Fig. 11. Clustered heatmap for multi-aspect feature analysis on 40 domains and 20 locations (complete map). The black area depicts clusters with very small number of members.




Data-driven Co-clustering Model of Internet Usage in Large Mobile Societies

Saeed Moghaddam, Ahmed Helmy, Sanjay Ranka, Manas Somaiya

Computer and Information Science and Engineering (CISE) Department, University of Florida, Gainesville, FL

{saeed, helmy, ranka, mhs}@cise.ufl.eduAbstract

Design and simulation of future mobile networks will center around human interests and behavior. We propose a design paradigm for mobile networks driven by realistic models of users' on-line behavior, based on mining of billions of wireless-LAN records. We introduce a systematic method for large-scale multi-dimensional co-clustering of web activity for thousands of mobile users at 79 locations. We find surprisingly that users can be consistently modeled using ten clusters with disjoint profiles. Access patterns from multiple locations show differential user behavior. This is the first study to obtain such detailed results for mobile Internet usage.

Categories and Subject Descriptors

C. Computer Systems Organization; C.2 Computer-Communication Networks; C.2.1 Network Architecture and Design; Subjects: Wireless communication; Nouns: Internet.
I. Computing Methodologies; I.6 Simulation and Modeling; I.6.5 Model Development; Subjects: Modeling methodologies.

General Terms

Measurement, Experimentation, Design, Performance

Keywords

Data-driven, Co-clustering, Wireless networks, Internet usage.

1.INTRODUCTION

Wireless mobile networks are growing significantly in every aspect of our lives. Laptops, handhelds and smart phones are becoming ubiquitous providing (almost) continuous Internet access and ever-increasing demand and load on supporting networks. This provides new challenges and opportunities for the modeling and design of future mobile networks. By developing realistic behavioral models for mobile users’ Internet access and website visitation patterns, novel behavior-aware network protocols can be developed and parameterized. Such behavioral models are essential and are established based on deep understanding of Internet usage of mobile users obtained through large-scale analysis of extensive wireless networks measurements. We refer to this approach as data-driven modeling and design paradigm.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

MSWIM’10, October 17–21, 2010, Bodrum, Turkey.
Copyright 2010 ACM 978-1-4503-0274-6/10/10...$10.00.

The eventual goal of the data-driven paradigm is to utilize analysis of the users’ behavior to drive the design of efficient context-aware protocols and services. This is in sharp contrast to the general-purpose design paradigm conventionally used in the wired Internet and much of the mobile (e.g., ad hoc) network design in the past decades. The general-purpose paradigm focuses on the design elements first, then evaluation using generic (usually random, uniform) models for traffic, Internet usage and mobility. Often, these models deviate dramatically from reality, which leads to sub-optimal performance or outright failure of the protocols during deployment. By contrast, the data-driven approach, shown in Figure 1, starts by the analysis and realistic modeling of the target context and users, that then drives the design process.

This paper focuses on the fundamental first step in this paradigm; the data-driven realistic modeling of user Internet behavior. Particularly, we study models of correlated mobile user website access patterns, using clustering. Such behavioral clustering will aid in the understanding of the spatio-temporal load distribution on the network and similarity of users interest, and thus inform the design of important classes of applications, including modeling and scenario generation for network simulations, network capacity planning, web caching and interest-aware networking protocols [1, 2] to name a few.

Figure 1. The data-driven modeling and design paradigm

To obtain the behavioral model, we process extensive netflow, DHCP and MAC trap traces for thousands of mobile users in a WLAN spanning over 79 buildings and including over 700 APs, that we have collected. This represents by far the largest set of traces processed in any study of mobile networks to date. We provide a systematic method to process billions of records to integrate and aggregate the multi-dimensional data. The scale of the data provides a great challenge to employing advanced data mining techniques. We propose to use an information-theoretic co-clustering technique in a novel way to extract important relations between clusters of mobile users and clusters of accessed websites. We show that this method can provide accurate and efficient clustering with minimal information loss. The traces are then broken per building category, including educational, housing, health, cinema, etc. A location-based clustering is then carried out based on website visitation pattern similarity. Our method is systematic and can be generally applied to discover important spatio-temporal features of Internet behavior from other similar traces.

We report two major findings in this paper: 1- Mobile users cluster with respect to website visitation patterns into a small set of clusters with clearly distinct profiles. For example, Mac users consistently visit ‘washingtonpost’, ‘cnet’ and ‘apple’ websites and not ‘microsoft’. While PC users visit ‘yahoo’, ‘google’ and ‘microsoft’ websites, but not ‘apple’. 2- Locations in similar categories tend to cluster together, with a few exceptions, in terms of mobile website access. We establish the stability of these findings for three month-long samples. These findings provide the basis for mobile user behavioral models both qualitatively and quantitatively, as we discuss in our applications section.

Our work has the following key contributions:

1.We collect and process the largest set of mobile network usage traces (billions of records) and provide practical techniques for integrating and aggregating the data.

2.We propose an effective approach for multi-dimensional analysis of the dataset, and show how information theoretic co-clustering can be applied to create and correlate clusters of users and web domains to build group-specific profiles.

3.We conduct context-specific (location-based) analysis of mobile users’ behavior, using two different methods (hierarchical clustering and graph clique detection) to effectively discover groups of locations with similar contexts. 

4.We obtain consistent results for clustering of mobile users and locations behavioral similarity that provide the basis for future models of mobile Internet usage.

The rest of the paper is organized as follows. In Section 2, we review the related work.  In Section 3, we address challenges associated with the collection, processing and analysis of large-scale wireless traces. Section 4 provides our case study using campus traces and co-clustering to develop realistic models of mobile societies. Section 5 discusses modeling and applications. Section 6 concludes.

2.RELATED WORK

The rapid adoption of wireless communication technologies and devices has led to a widespread interest in analyzing the traces to understand user behavior. The scope of analysis includes WLAN usage and its evolution across time [3-5], user mobility [6-8], traffic flow statistics [9], user association patterns [10] and encounter patterns [11, 12]. Some previous works [6, 11] explore the space of understanding realistic user behaviors empirically from data traces. The two main trace libraries for the networking communities can be found in the archives at [13] and [14]. None of the available traces provides large-scale netflow information coupled with DHCP and WLAN sessions to be able to map IP addresses to MAC addresses to AP to location and eventually to a context (e.g., history department). Therefore, (to the best of our knowledge) our work represents the first one to address large-scale multi-dimensional modeling of wireless and mobile societies.   We analyze wireless mobile data around three orders of magnitude above any existing or ongoing study, providing finer granularity, richer semantics and potentially more accurate models. Our work also includes novel data processing techniques to address the challenges provided by this large-scale multi-dimensional data.

There are several prominent examples of utilizing the data sets for context specific study. Mobility modeling is a fundamentally important issue, and several works focus on using the observed user behavior characteristics to design realistic and practical mobility models [15-18]. They have shown that most widely used existing mobility models (mostly random mobility models, e.g., random walk, random waypoint; see [19] for a survey) fail to generate realistic mobility characteristics observed from the traces. Realistic mobility modeling is essential for protocol performance [20]. It has been shown that user mobility preference matrix representation leads to meaningful user clustering [21]. Several other works with focus on classifying users based on their mobility periodicity [22], time-location information [23, 24], or a combination of mobility statistics [25]. The work on the TVC model [15] provides a data-driven mobility model for protocol and service performance analysis. Our work is complementary to TVC and can extend TVC dramatically to incorporate dimensions of load, interest and web-site visitation preferences. In addition, the netflow traces are over three orders of magnitude more than the WLAN traces and the techniques used for their analysis and clustering are quite different. In [9] it was shown that the performance of resource scheduling [26] and TCP vary widely between trace-driven analysis and non-trace-driven model analysis. Using multi-dimensional modeling, our methods can develop new mobility-aware Internet-usage models, and utilize the realistic profiles to enhance the performance of networking protocols. Our new application of co-clustering techniques incorporates web activity, location and mobility, and provides user profiles that may be used in a myriad of networking applications.

One network application for multi-dimensional modeling is profile-based services. Profile-cast [1, 2] provides a new one-to-many communication paradigm targeted at a behavioral groups. In the profile-cast paradigm, profile-aware messages are sent to those who match a behavioral profile. Behavioral profiles in [1, 2] use location visitation preference and are not aware of Internet activity. Other previous works also rely on movement patterns. Our multi-dimensional modeling of mobile users, however, provides an enriched set of user attributes that relate to social behavior (e.g., interest, community as identified by web access, application, etc.) that has been largely ignored before.

3.MODELING APPROACH

Data-driven modeling of large mobile societies requires three main phases to collect, process and analyze multi-dimensional large datasets with fine granularity (Figure 2). In the first phase, extensive datasets are collected using the network infrastructure (or the mobile devices), plus augmenting information from online directories (e.g., buildings directory, maps) and the web services (e.g., whois lookup service). Data processing is the second phase to cross-correlate acquired information from different resources (e.g., access points, IP and MAC addresses), in which multiple datasets are manipulated, integrated and aggregated. The final phase is data analysis which includes global and location-based (context-specific) study of human behaviors based on their website access preferences and also the stability analysis of the findings for each case.

3.1Data Collection

For the campus-wide modeling of wireless users, we collect different types of traces via network switches including netflows, DHCP and wireless AP session logs (MAC traps). An IP flow is defined as a unidirectional sequence of packets with some common properties (e.g., IP address and port number source and destination) that pass through a network device (e.g., router) . This device can be used for flow collection. The collected data provides fine-grained metering for detailed usage analysis. Network flows are highly granular; flow records include the start and finish times (or duration), source and destination IP addresses, port numbers, protocol numbers, and flow sizes (in packets and bytes) (see Table 1). The destination IP address can be used to identify the websites accessed, while the port and protocol numbers can identify the application used. The wireless session log is collected by each wireless access point (AP) or switch port (i.e., aggregate of APs in a building). The trace includes the ‘start’ and ‘end’ events for device associations (when they visited or left that specific AP), the device’s MAC address, the date and time of those events, and the AP (or switch) IP and port numbers.  From the above we can derive the association history (i.e., the location and time of user association) for all MAC addresses. The DHCP log contains the dynamic IP assignments to MAC addresses. The listed IP is given to the MAC address at the indicated date and time.  

Figure 2. Phases of modeling: collection, processing, and analysis.

3.2Data Processing

Data processing includes three steps of data manipulation, data integration and data aggregation to cross correlate the collected data before data analysis.

3.2.1  Data Manipulation

The variety and scale of different collected traces introduce one of the main challenges with respect to data manipulation. The size of the underlying data is very large and therefore, with a naïve approach the required time for each manipulating action will be in the order of a month, with tens of manipulations needed.  For example, the netflow dataset gathered from the USC campus includes around 2 billion of flow records for each month in 2008 which equals to 2.5 terabytes of data per year. Thus, appropriate methods for data manipulation are needed. Our approach to diminish the problem is to first compress the data via substituting similar patterns with binary codes and creating mapping headers to be used in future manipulation; then get the data exported into a database management system (MySQL) and finally design customized store procedures for the manipulation of data in a reasonable time.

Table 1. Netflow sample

Start Timestamp Finish Timestamp Source IP Source Port Dest IP Dest Port Protocol Num ToS Packet Count Flow Size

3.2.2  Data Integration

The second requirement of multi-dimensional modeling is data integration. Data from different sources are not gathered in the same format and therefore a semantic link is required to be created in between them. For example, in our case study, users are represented by MAC addresses in wireless session logs and by IP addresses in netflow traces. However, when the data scale for one of the traces (in this case netflows) is very large, the cost of such integration using regular SQL commands increases dramatically. Thus, we designed customized stored procedures for this purpose.

3.2.3  Data Aggregation

Since the output of the integration process includes billions of records, we cannot directly feed the result to the analysis phase. Running rather any data mining method on such a large dataset will take years to accomplish. Therefore, we need an intermediate aggregation process for building design-specific views of the dataset. We can aggregate the records based on one or a set of fields e.g., time, user, location, domain name and application. The choice of appropriate aggregation scheme depends on the final design and modeling goals. If we are interested in studying usage patterns for different domains at different locations without considering single users or type of application, an aggregation on domain name, location and time for the number of bytes, packets or flows will be the best choice. If the goal is the study of users’ spent time at different websites for different months, which is the case in our case study, we need to aggregate based on user id, domain name and months.   

3.3Data Analysis 

The data analysis phase is performed at three levels. The first level is to create a global model of dynamics within the network. This model is needed to provide a big picture of the dataset from the desired point of view. The next level is to build and analyze location and context-specific models, e.g., website access models in different types of buildings. The third level is to analyze the stability of learned models and get them revised if needed to get sufficient stability required for a solid model-based design.

3.3.1  Global Analysis and Co-clustering 

The main goal of global analysis is to provide a big picture of dynamics within the network. For this purpose, a very well known approach is to cluster entities (e.g., users, websites) with similar characteristics. However, a major challenge in modeling of multi-dimensional datasets is the fact that ordinary one-sided clustering algorithms like hierarchical clustering or k-means can only cluster data along different dimensions separately [27], i.e., either we get clusters of websites or clusters of users in our case. The proposed approach to resolve this problem is to apply co-clustering techniques, which cluster the input dataset along multiple dimensions simultaneously. In this way, we can correlate different dimensions in a unique model.

In our campus-wide case study, we first investigated applying bipartite graph co-clustering [28]. A graph formulation is used in this algorithm coupled with a spectral heuristic (using eigenvectors) to co-cluster the 2-dimensional input data. However, a restriction of this algorithm was that each row cluster was associated with one column cluster, a restriction, which we found inappropriate to impose on our input dataset due to the variety of users’ trends. Therefore, we instead choose another approach; the information theoretic co-clustering [29] for simultaneous clustering of users and domains to obtain a global model of the mobile society.  We feed the users on-line activity matrix, which represents the time spent by users at different websites, into the algorithm. The theoretical formulation of this co-clustering technique treats the (normalized) non-negative contingency table as a joint-probability distribution of two discrete random variables, whose values are given in the rows and columns, and poses the co-clustering problem as an optimization problem in information theory. In this technique, co-clustering is performed by defining mappings from rows to row-clusters and from columns to column-clusters. These mappings produce clustered random variables. The optimal co-clustering is one that leads to maximum mutual information between the clustered random variables, and minimizes the loss in mutual information between the original random variables and the mutual information between the clustered random variables. This algorithm monotonically increases the preserved mutual information and optimizes the loss function by intertwining both row and column clustering. Row clustering is performed by calculating closeness of each row distribution (in relative entropy) to row cluster prototypes. Column clustering is performed similarly. This iterative process converges to a local minimum. This algorithm differs from one-sided clustering in that the row cluster prototypes incorporate column clustering information, and vice versa. The algorithm never increases the loss, and so, the quality of co-clustering improves gradually. It also ameliorates the problems of sparsity and high dimensionality. Iteratively, the method performs an adaptive dimensionality reduction and estimates fewer parameters than one-dimensional clustering approaches, resulting in a regularized clustering. In addition, the algorithm is efficient. The computational complexity of the algorithm is given by O(N · τ · (k + l)) where k and l are the desired number of row and column clusters, N is the number of non-zeros in the input joint distribution and τ is the number of iterations; empirically 20 iterations are shown to suffice.

The number of values along each dimension (represented by the number of mobile users and number of internet sites) is very large. Given, the size limitations of existing co-clustering algorithms and their implementation, we filter the dataset and limit ourselves to the most active websites and aggregate destination IP addresses and websites based on domains.

After executing the algorithm on the filtered dataset and getting the clusters of users and domains, we can create an association level matrix indicating the association level in between clusters along different dimensions.  For each pair of a user cluster and a domain cluster, the association level is calculated by summing up the amount of all joint probabilities between them.

3.3.2  Location-based Analysis

The main goal of location-based analysis is to discover different contextual clusters within a large mobile society.  For this purpose, we first define a uniform way for describing the context of a location. Then, we formulate a comparison method between different locations to find levels of context similarities. Finally, we devise an appropriate method to detect contextually similar locations.   

As for the first step, the global acquired clusters of users and domains can be employed to provide a uniform way for context description.  For each location, an association level matrix between groups of users and domains can be created using a uniform ordering of the acquired user and domain clusters. We employ the location specific association level matrix as a context descriptor of the location.

In the second step, we provide a method for comparing the context descriptors of different locations. For this purpose, we treat the corresponding association level matrices as vectors of all their values and employ cosine distance function. Using this method, we can create a dissimilarity matrix for different locations based on their context descriptors.

For the final step, we propose two different methods for finding groups of contextually similar locations. The first technique is to use hierarchical clustering to create clusters of locations. The second method is to map the dissimilarity matrix to a undirected graph as follows. Considering a node for each location, we draw an edge between two different nodes if their dissimilarity is less than a threshold. Then, we find cliques within the graphs to discover groups of locations with similar contexts.  

3.3.3  Stability Analysis

An important goal in data modeling is to discover stable models that can accurately describe not only the current state but also its time evolution and dynamics (i.e., its history and future). Such models are valuable in the sense that they can explain major trends during a long period of time (e.g., a semester) and thus can be effectively used for realistic and durable model based designs.

To assess the relative stability of trends captured by global and location based models, we investigate whether the discovered clusters of users, domains and locations are sufficient to describe the history and the future of the mobile Internet access. Our method for measuring this forward, backward stability of the discovered clusters over time is to: 1- take all the interactions for the same sets of users, domains and locations during the previous and the next periods of the analyzed period; then, 2- for each of the periods, recreate the global association matrix and the location dissimilarity matrix using the same acquired clusters; and finally, 3- calculate the distance in between corresponding matrices for the analyzed period and the previous/next one. In this calculation, matrices are again treated as vectors of all their values and their distance is determined by cosine distance function.   

4.CASE STUDY AND EXPERIMENTAL RESULTS

In our case study, we conduct a campus-wide analysis on data we collected from the University of Southern California (USC) in 2008 based on the approach and techniques explained in the previous section.

4.1Data Processing Details

The netflow and DHCP traces from the USC campus (over 700 access points) were processed to identify mobile user IDs using MAC addresses, and destinations, or ‘peers’ (usually web servers) using IP address prefixes. Over a billion records (for the month of March 2008) were considered initially, then the February and April traces (over two billion records) were considered for the stability analysis. The IP prefixes (first 24 bits) were filtered using a threshold of 100,000 flows (the reason for using 24 bits filter is the fact that popular websites usually use an IP range instead of a single IP address).  For the filtered IP prefixes, their domains were resolved.  Among the resolvable domains, the top 100 active ones were identified and all the users interacting with those domains (e.g., Google, Facebook, etc.) were considered for the analysis.

4.2Global Analysis Results

A matrix was created associating the user IDs and domains (i.e., websites) using the corresponding total online time (per minute). For our analysis, we had 22,816 users, and 100 domains. The data is scaled using row-normalization of log the online time values. This is the input data for our modeling problem for which we applied the information theoretic co-clustering. In this case study, we discuss results for ‘ten’ clusters (i.e., with ‘ten’ as input to the algorithm for the number of output user and domain clusters). Using this method of co-clustering, we produce two collections of domain clusters and user clusters, which are used to determine an association level between each pair of user and domain clusters. Figure 3 shows the result of applying this method on the scaled data and Figure 4 depicts the association level matrix between the resulting clusters. Each row in Figure 4 identifies a group of users in terms of their association level to different groups of domains. In other words, each row provides a representation of users’ interests to different groups of domains.

As shown, the algorithm is able to group users with similar access patterns into clusters.  In a way, users within the same cluster maybe characterized by similar set of favorite wireless on-line activities. At a high level, we observe four general classes of user clusters with: (a) narrow access (cluster 1): users access two (or less) clusters of domains, in this case clusters I and J include ‘usc’ and ‘infoave’ (telecom and webhosting), (b) narrow spread access (clusters 2,3): most user access time is spread over 3 or 4 domain clusters only, (c) medium spread access (clusters 4-7): most user access time is spread over 5 to 8 domain clusters, and (d) wide spread access (clusters 8-10): with noticeable user access in all domains. A deeper look into the clusters reveals some interesting trends. Clusters 2 and 3 include narrow spread users, but include clearly distinct sets of user interest. Cluster 2 shows users who mostly just utilize the Internet for search or email via ‘yahoo’ and ‘google’, and visit microsoft for probably getting software updates, and thus are likely Microsoft/PC users. Cluster 3, by contrast, shows users frequently go to ‘apple’ and ‘mac’ sites but rarely go to ‘microsoft’, and thus are likely Mac users. Note that these users are commonly interested in ‘washingtonpost’ and ‘cnet’ but not interested in ‘facebook’ or ‘yahoo’ at all. Cluster 7 also depicts heavy Mac users.  Again, as can be seen in the figure, these users rarely go to ‘microsoft’ but are interested in ‘mac’, ‘apple’, ‘washingtonpost’, ‘cnet’ and also visit ‘facebook’,  ‘yahoo’ and some other websites frequently. Table 2 shows some other  domains which are clustered together.

Figure 4. Association level of resulting user and domain clusters by applying information theoretic co-clustering (March 2008)

Table 2. Major related websites clustered together

Cluster

Domains

A myspace – imeem (social media service) - digg (social news) – typepad (blogging) - ebayrtm - ebayimg - wsj (business news) -bodoglife (online gambling) - ucsb - harward - westlaw

B cnn – nytimes (new york times)

C mcafee – hackerwatch - live - hotmail

D ebay - bankofamerica

E apple – mac - washingtonpost - cnet

F facebook – youtube - social media msn - msnbcsports

G netflix – itunes - orb (media cast) - tmcs (social city search) - virtualearth (online map)

H google – yahoo - microsoft – windowsmedia
microsoftoffice2007

Stability Analysis: To assess the relative stability of trends, we process the records from Feb. 2008 and Apr. 2008 and recreate the global association matrices for them using the same clusters and ordering of users and domains in Figure 3. The results are depicted in Figure 5 and indicate that the trends hold to a large extent; the association level matrix for February is 92.25 percent and for April is 89.18 percent similar to that of March; plus, the association level matrices of February and April are 98.51 percent similar. This indeed indicates the stability of the results.

a) Co-clustered matrix for February

b) Association level matrix for February

c) Co-clustered matrix for April

d) Association level matrix for April

Figure 5. Using the same column and row ordering of users and domains obtained via co-clustering of the Mar. 2008 trace, these graphs are constructed using Feb. 2008 and Apr. 2008 measurements. The trends are relatively stable from one month to another especially for the narrow spread and wide spread clusters.

4.3Location-based Analysis Results

In the second phase, we model locations based on their acquired context descriptors and analyze the results based on their actual context. For this purpose, for any interaction between a user and a domain, we first identify the switch-port that handle the connection using the WLAN traces. Then, we associate the interaction to its location among 84 buildings across the campus using a mapping table between the switch-ports and buildings. Next, all active buildings (handling at least one interaction) in March 2008 are selected (79 buildings) and their context descriptors are created as explained in Section 3.3. Finally, we create dissimilarity matrix for all the selected buildings using the context descriptors and the metric explained in the same section. This matrix is used by two different techniques based on hierarchical clustering and graph clique detection to discover groups of contextually similar buildings. Figure 6 shows the result of applying hierarchical clustering for creating 10 clusters of locations. In the figure, all clusters can be identified using green line borders and all distances (dissimilarities) can be figured using the z-axis.  For each cluster, Figure 7 shows the average dissimilarity between each corresponding building and all the others. To analyze the resulting clusters, we studied all the buildings and based on their actual context categorized them into 10 categories including: housing, auditorium, (outdoor) activity, sorority, fraternity, school, health, music, cinema and service (see Table 3). In Figure 6, the category of each building is visualized by the assigned color in Table 3.

As can be seen in Figure 6, most of the buildings in the same category are clustered together into one or two clusters. For example sororities are all clustered together and fraternities form two major clusters and two uni-member clusters. The interesting point about the fraternities is the fact that those two uni-member clusters include professional fraternities and the other two contain social ones. We can also see that all auditoria are in the same cluster as well as cinema-related buildings. Regarding the “activity” category that includes buildings with different activity context including sports, religion, social and shopping, we notice that 6 out of 8 are in the same clusters while 3 of 6 are sports related. In addition, it can be observed that housing buildings form two major clusters and there is only one separated building in

another cluster. The study of the building reveals the fact that it is the only housing complex that includes a plaza and a bookstore too. Health related buildings are also assigned into two main clusters. However, buildings in school and service categories are almost scattered among clusters because of the fact they include different types of schools for social work, journalism, humanities, letters and arts, law and leadership and different kind of centers for facilities management, financial, communication and computing services.

Stability Analysis: As before, to assess the relative stability of trends, we process the records from Feb. 2008 and Apr. 2008 and recreate the dissimilarity matrices for them using the same acquired clusters for March. The results are depicted in Figure 8 and indicate that the trends hold to a large extent; the dissimilarity matrix for February is 92.72 percent and for April is 95.12 percent similar to that of March; plus, the dissimilarity matrices of February and April are 93.35 percent similar. This indeed indicates the stability of the results.

Figure 7. Average dissimilarity between each building in a cluster and all the other buildings (March 2008). Y-axis shows the dissimilarity between 0 and 0.7.

The second method detects cliques within the corresponding graph for dissimilarity matrix using the threshold of 0.06 as explained in Section 3.3. Figure 9 shows the resulting graph layout for the data from March. As can be inferred from the histogram for the dissimilarity matrix (Figure 10), the resulting graph includes around 10 percent of all possible edges using the mentioned threshold. As can be seen in the graph, a clear relationship exists between identified cliques and the actual categories of buildings.

Table 3. Building categories. The assigned colors to the categories are used in Figure 6 and Figure 9.

5.DISCUSSIN: MODELING AND APPLICATIONS

The systematic data-driven mining method proposed in this paper can be used with any set of wireless data to discover significant features that may be used as a similarity metric for mobile users. The method, and our findings from the global and location-based analyses can be used in several important applications in mobile networking research. Here, we specifically address (albeit briefly for lack of space) three such major applications:

1- Modeling and simulating spatio-temporal web usage for mobile users: Network simulations are essential for the design and evaluation of mobile networks (e.g., ns-2). To provide realistic input to the simulations, realistic models of node behavior are needed, along with scenarios of events and dynamics of mobility, traffic and Internet access. While earlier work has focused on mobility and traffic modeling, we provide the first work on modeling of mobile Internet website access. The spatio-temporal parameters of online activity (in terms of time of access, duration and location context), along with correlation and clustering between nodes in the simulation can be easily derived from our analysis in this paper. None of the existing models captures such spatio-temporal correlation across website access, nor does it capture correlations between nodes in that aspect. Recreating network usage more accurately will result in significantly different node density, load, and similarity distributions from those created by today’s models.

One model that can benefit directly from our analysis is a mobile Internet usage (website access) model, with inputs including number of user-website clusters, distribution of user and website cluster sizes along with access pattern characterization (i.e., wide or narrow spread). Similarly, for location categories and clusters of buildings, the spatial distribution of website access patterns can borrow from our findings in the location-based analysis. Developing and releasing the code for the mobile Internet access model is part of our future work. Similarly, we plan to conduct an extensive study on the spatio-temporal parameters for mobile traffic modeling in the future.

a) February b) April

Figure 8. Clusters of buildings for Feb. and Apr. 2008

2- Interest-based protocols and services: A new class of protocols and services center around user-interest and similarity, including profile-cast, trust establishment [30], participatory sensing [31], crowd sourcing, location-based services, alert notification and targeted announcements and ads. So far, mobility patterns (e.g., in profile-cast) have been used to infer interest. Website access patterns can significantly enhance the accuracy of interest inference and provide much needed granularity for these protocols and services. The interest models developed based on our analysis can aid both the informed design of such efficient protocols and the realistic evaluation thereof.

3- Network planning and web caching: Load distribution on the network is essential for network capacity planning and on-going configuration and management issues, and is certainly related to web access patterns and its clustering. Also, the caching of web objects for mobile users can only be efficient if informed by the history of access patterns. These applications for mobile networks are becoming more compelling especially with the significant increase of usage of smart phones, iphones, ipads, and the like.

Figure 9. Graph representation of dissimilarity matrix using the threshold of 0.06 for March 2008.  (See Table 3 for the mapping between colors and categories)

Figure 10. Histogram for the dissimilarity matrix. X-axis shows the dissimilarity between 0 and 1.

6.CONCLUSION

This study is motivated by the need for a paradigm shift that is data-driven to develop realistic models and efficient protocols for the future mobile Internet. We provided a systematic method to process and analyze the largest mobile trace to date, with billions of records of Internet usage from a campus network, including thousands of users and dozens of buildings. Novel analysis was conducted utilizing advanced data mining using efficient co-clustering, at the global and location-based levels. We have shown that mobile Internet usage can be modeled with a strikingly small number of clusters of distinct web access profiles. Similarly, building categories show very distinct Internet usage patterns and are often clustered together. These trends were found to be highly stable over time.

The details of our study enable the parameterization of new and realistic models for mobile Internet usage with applications in several areas of networking, including simulation and evaluation of protocols, mobile web caching, network planning and interest-aware services, to name a few. We hope for our method and analysis to provide an example for large-scale data-driven modeling of mobile networks in the future. With more measurements from mobile and sensor networks becoming available, we expect our method to facilitate analysis of many other large datasets in future studies.

7.REFERENCES

[1]   Hsu, W., Dutta, D. and Helmy, A. Profile-cast: Behavior-aware mobile networking. SIGMOBILE Mobile Computing and Communications Rev., 12, 1 (Jan 2008), 52-54.

[2]   Hsu, W., Dutta, D. and Helmy, A. CSI: A Paradigm for Behavior-oriented Profile-cast Services in Mobile Networks. IEEE/ACM Transactions on Networking, to appear.

[3]   Tang, D. and Baker, M. Analysis of a local-area wireless network. In Proceedings of the ACM MobiCom 2000 (Boston, Massachusetts, United States, Aug, 2000). ACM.

[4]   Kotz, D. and Essien, K. Analysis of a campus-wide wireless network. Wirel. Netw., 11, 1-2 (Jan 2005), 115-133.

[5]   Henderson, T., Kotz, D. and Abyzov, I. The changing usage of a mature campus-wide wireless network. Computer Networks, 52, 14 (Oct 2008), 2690-2712.

[6]   Hsu, W. and Helmy, A. On modeling user associations in wireless LAN traces on university campuses. In Proceedings of the IEEE WiNMee 2006 (Apr, 2006).

[7]   Balazinska, M. and Castro, P. Characterizing mobility and network usage in a corporate wireless local-area network. In Proceedings of the ACM MobiSys 2003 (San Francisco, CA, 2003). ACM.

[8]   McNett, M. and Voelker, G. M. Access and mobility of wireless PDA users. SIGMOBILE Mob. Comput. Commun. Rev., 9, 2 (Apr 2005), 40-55.

[9]   Meng, X., Wong, S. H. Y., Yuan, Y. and Lu, S. Characterizing flows in large wireless data networks. In Proceedings of the ACM MobiCom 2004 (Philadelphia, PA, USA, 2004). ACM.

[10] Papadopouli, M., Shen, H. and Spanakis, M. Characterizing the duration and association patterns of wireless access in a campus. In Proceedings of the 11th European Wireless Conference (Nicosia, Cyprus, Apr, 2005).

[11] Hsu, W. and Helmy, A. On Nodal Encounter Patterns in Wireless LAN Traces. In Proceedings of the IEEE WiNMee 2006 (Apr, 2006).

[12] Chaintreau, A., Hui, P., Crowcroft, J., Diot, C., Gass, R. and Scott, J. Impact of human mobility on opportunistic forwarding algorithms. IEEE Transactions on Mobile Computing(Jun 2007), 606-620.

[13] MobiLib: Community-wide Library of Mobility and Wireless Networks Measurements (Investigating User Behavior in Wireless Environments). http://nile.cise.ufl.edu/MobiLib/.

[14] Kotz, D. and Henderson, T. Crawdad: A community resource for archiving wireless data at dartmouth. IEEE Pervasive Computing(Dec 2005), 12-14.

[15] Hsu, W.-J., Spyropoulos, T., Psounis, K. and Helmy, A. TVC: Modeling spatial and temporal dependencies of user mobility in wireless mobile networks. IEEE/ACM Trans. Netw., 17, 5 (Oct 2009), 1564-1577.

[16] Jain, R., Lelescu, D. and Balakrishnan, M. Model T: a model for user registration patterns based on campus WLAN data. Wirel. Netw., 13, 6 (Dec 2007), 711-735.

[17] Lelescu, D., Kozat, U. C., Jain, R. and Balakrishnan, M. Model T++: an empirical joint space-time registration model. In Proceedings of the 7th ACM MOBIHOC (Florence, Italy, May, 2006). ACM.

[18] Kim, M., Kotz, D. and Kim, S. Extracting a Mobility Model from Real User Traces. In Proceedings of the IEEE INFOCOM 2006 (Barcelona, Spain Apr, 2006).

[19] Bai, F. and Helmy, A. A Survey of Mobility Modeling and Analysis in Wireless Adhoc Networks, Wireless Ad Hoc and Sensor Networks, Springer, 2006.

[20] Bai, F., Sadagopan, N. and Helmy, A. The IMPORTANT framework for analyzing the Impact of Mobility on Performance Of RouTing protocols for Adhoc NeTworks. Ad Hoc Networks, 1, 4 (Nov 2003), 383-403.

[21] Hsu, W., Dutta, D. and Helmy, A. Mining behavioral groups in large wireless LANs. In Proceedings of the ACM MobiCom 2007 (Montral, Qubec, Canada, 2007). ACM.

[22] Kim, M. and Kotz, D. Periodic properties of user mobility and access-point popularity. Personal Ubiquitous Comput., 11, 6 (Aug 2007), 465-479.

[23] Eagle, N. and Pentland, A. Reality mining: sensing complex social systems. Personal and Ubiquitous Computing, 10, 4 (May 2006), 268.

[24] Ghosh, J., Beal, M. J., Ngo, H. Q. and Qiao, C. On profiling mobility and predicting locations of wireless users. In Proceedings of the 2nd international workshop on Multi-hop ad hoc networks: from theory to reality (Florence, Italy, 2006). ACM.

[25] Tang, D. and Baker, M. Analysis of a metropolitan-area wireless network. Wirel. Netw., 8, 2/3 (Nov 2002), 107-120.

[26] Borst, S. User-level performance of channel-aware scheduling algorithms in wireless data networks. Ieee Acm T Network, 13, 3 (Jun 2005), 636-647.

[27] Jain, A. K. and Dubes, R. C. Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ, 1988.

[28] Dhillon, I. S. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the ACM SIGKDD 2001 (San Francisco, CA, 2001). ACM.

[29] Dhillon, I. S. and Guan, Y. Information Theoretic Clustering of Sparse Co-Occurrence Data. In Proceedings of the Third IEEE International Conference on Data Mining (2003). IEEE Computer Society.

[30] Kumar, U., Thakur, G. and Helmy, A. PROTECT: proximity-based trust-advisor using encounters for mobile societies. In Proceedings of the ACM IWCMC 2010 (Caen, France, Jun, 2010). ACM.

[31] Reddy, S., Estrin, D. and Srivastava, M. Recruitment Framework for Participatory Sensing Data Collections. Pervasive Computing(May 2010), 138-155.


Download my TranscriptsHome_files/Transcripts.pdfHome_files/Transcripts_1.pdfshapeimage_11_link_0
Transcripts
Download my CVHome_files/Saeed%20Moghaddam%20CV%20Aug%202011.pdffile://localhost/Saeed%20Moghaddam%20CV%20Aug%202011.pdfshapeimage_13_link_0