Computer Science Assignment help on Profiling Internet Users

1Computer Science Assignment help on Profiling Internet Users
Profiling Internet Users
The source of data for this project is Cisco NetFlow version 5, which is one of the most popular
technologies to collect IP traffic. Many parameters can be extracted from the source data including;
Packets, Octets, beginning and ending of each flow, source and destination port numbers, source
and destination IP addresses and many other variables which are included in the following figure.
To preserve the privacy, all the IP addresses are removed from the data. 54 Excel files are included
in the project which each file corresponds to one user. Data is captured for a month long period.
In average, the number of flows for each subject over a week worth of data is more than 7000.
You may download the files from the link below:
https://drive.google.com/drive/folders/0Bw41Rn20xkcRenJmMEhtbzhVX0k
In this project, we want to demonstrate if the Internet usage of each subject is statistically
indistinguishable when compared to the Internet usage of the same subject over time, while
simultaneously being statistically distinguishable when compared to Internet usage of other
subjects. Subsequently, we want to study how the time window chosen for profiling affects the
answer to the above problem. You can implement a profile for each user based on many criteria;
however, it is highly suggested to use the variable named Internet usage, which can be calculated
as a ratio of octets/duration.
You should write a program in any language that you are familiar in order to get network data as
the input and do the statistical analysis to find the distinguishability or indistinguishability between
subjects. Each file should be opened and compared with rest of the files. It is important to mention
that each file should be split in parts. For example, you can split into groups of 10 seconds or as
high as 24 hours. You should compare three time windows of 10 seconds, 227 seconds, and 5
minutes to find out which time window has the least number of users that are statistically
distinguishable when compared to Internet usage of other subjects while simultaneously
statistically indistinguishable when compared to the Internet usage of the same subject. Each time
window has several flows, you may find the average for the variable octets/duration in each
window. Note that some flows have a duration of 0 which due to the reason that the granularity is
too short, the duration is 0 millisecond. Since you need to divide octets over duration and dividing
by zero is undefined, you may not consider flows with the duration of 0.
2
Data splitting can be done only based on the column that is named “Real First Packet” and included
in the excel files. You do not need to consider the “Real End Packet” column. This column shows
the initial date and time of each flow in epoch format. You may convert this value to a human
readable data and time. You may use the following link to find a method in most of the
programming languages for this conversion.
https://www.epochconverter.com/
You may do some initial calculations on the Excel files like calculating the ratio of octets/duration.
However, it is recommended to write all the tasks in the programing that you are familiar. In this
way, it is much easier to keep track of the data from input to the output.
For the statistical analysis part, you may use the flowing steps.
1. After splitting the data into parts as explained before, you need to find the correlation between
them. Since you are comparing correlations across weeks, you should split the months’ worth of
Internet usage data into four groups each for four weeks for all subjects. A brief snapshot of two
weeks data for two subjects across time is shown in the following figure. In the following sample,
window of 227 seconds is chosen. Column on the left is showing for a sample “User A” and the
column on the right is showing a sample data for “User B”. Each row represents a window. For
an instance, first row represents data for Monday from 00:00:00am to 00:03:47am which is a 227-
second window with the value of 6.3972 for the parameter of octets/duration. Similar procedure
was done until Friday 11:56:13pm to 00:00:00am that is the last window in the week.
2. At this step you need to calculate the correlation coefficient values. There are three main type
of correlation coefficients. In this project, it is recommended to use the Spearman’s correlation
3
coefficient. You may calculate it in Excel or even you can find the formula and implement it in
your code. You need to find three correlation values of r1a2a, r1a2b and r2a2b. Numbers are showing
the weeks and characters are showing the subjects. For example r1a2a denotes the Spearman’s
correlation coefficient between Internet usage of “Subject a” for week 1 with Internet usage of
“Subject a” for week 2. Similarly, r1a2b denotes the Spearman’s correlation coefficient between
Internet usage of “Subject a” for week 1 with Internet usage of “Subject b” for week 2 and r2a2b
denotes the Spearman’s correlation coefficient between Internet usage of “Subject a” for week 2
with Internet usage of “Subject b” for week 2. For the calculation, you may use the following
formula in your code:
= 1 − 6 ∑
2
(2 − 1)
: The difference between the ranks of corresponding variables
: Number of observations (Number of windows in a week)
If you are not familiar with Spearman’s correlation you may learn from the link below:
https://www.wikihow.com/Calculate-Spearman%27s-Rank-Correlation-Coefficient
3. Based on the correlation values which are calculated in the previous step, the main part of this
project can be done. For the statistical framework of this project, Meng, Rosenthal, and Rubins Z
Test Statistic (MRR-Z test) can be employed to find the value of Z. Required formulas are included
below. Correlation coefficients that are calculated in the last step can be imported in the Z formula.
= [12 − 12] ∗ �[ − 3]
2 ∗ [1 − 22] ∗ ℎ
12 = 1
2
1 + 12
1 − 12
12 = 1
2
1 + 12
1 − 12
ℎ = 1 − [ ∗ 2]
1 − 2
= 1 − 22
2 ∗ [1 − 2]
2 = 12 2 + 12
2
2
N: Sample size of the data set (Number of windows in a week)
4. Based on the Z value calculated from the previous part, the corresponding P-value can be
computed as follows:
4
= 1 − Φ()
where Φ() is the cumulative distribution function of standard normal distribution. You may use
the following function in your code to find the P-value. This function is written in C# but you may
modify it to use in any other languages.
static double PFunction(double z)
{
double p = 0.3275911;
double a1 = 0.254829592;
double a2 = -0.284496736;
double a3 = 1.421413741;
double a4 = -1.453152027;
double a5 = 1.061405429;
int sign;
if (z < 0.0)
sign = -1;
else
sign = 1;
double x = Math.Abs(z) / Math.Sqrt(2.0);
double t = 1.0 / (1.0 + p * x);
double erf = 1.0 – (((((a5 * t + a4) * t) + a3)
* t + a2) * t + a1) * t * Math.Exp(-x * x);
return 0.5 * (1.0 + sign * erf);
}
5. Finally, based on the value that calculated from the previous step, you can decide that two users
are distinguishable or indistinguishable from each other. When ≤ 0.05 means that correlation
coefficient calculated for Internet usage patterns for an unknown subject (say b) is significantly
smaller than that for a known subject (say a) and as such “subject b” will be identified as a subject
distinct from “subject a”. On the contrary, when > 0.05, indicates that correlation coefficient
calculated for Internet usage patterns for an unknown subject (say b) is not significantly smaller
than that for a known subject (say a), and as such “subject b” will be identified as indistinguishable
from “subject a”.
To finish, you need to write a report and briefly explain the procedure and include a table, which
shows three time windows of 10 seconds, 227 seconds, and 5 minutes that used in the project and
the average number of matches (average number of values that are greater than 0.05 across all
the users) for each window and state which window is better in terms of authentication. By better
authentication, what we mean is that, you need to find out at what time window, it happens that a)
each user’s data in one week is statistically indistinguishable from the same user’s data across
another week; and b) the number of other users whose data is statistically indistinguishable from
a particular user is minimum (ideally 0).
For this project, you will report results by comparing data across Week 1 – Week 2; Week 2 –
Week 3; Week 3 – Week 4.
You will submit all code, and a report as a Zip file on Canvas. Deadline is April 26th Midnight.
Earlier submissions are encouraged, so that grader can verify if all files are there.
5

Please follow and like us: