Using machine learning techniques to classify TOR traffic
The Tor network is a project with the objective of providing private traffic to its users, so that they may, but not restricted to, block trackers, defend against surveillance, resist fingerprinting and have multi-layered encryption for the traffic. Tor empowers users so that they should be able to circumvent censorship and explore the internet with privacy. In this essay, I employ machine learning techniques to discover and classify traffic that occurs under Tor, which exposes the users digital fingerprint to a degree.
What is Tor?
Tor is the onion routing network. Its goal is to improve the user’s privacy by sending traffic through various proxies. Communication is encrypted in multiple layers and routed via multiple hops through the Tor network to the final receiver. Citing the Tor project about page, Tor solves three privacy problems:
“First, Tor prevents websites and other services from learning your location, which they can use to build databases about your habits and interests. With Tor, your Internet connections don’t give you away by default — now you can have the ability to choose, for each connection, how much information to reveal.
Second, Tor prevents people watching your traffic locally (such as your ISP or someone with access to your home wifi or router) from learning what information you’re fetching and where you’re fetching it from. It also stops them from deciding what you’re allowed to learn and publish — if you can get to any part of the Tor network, you can reach any site on the Internet.
Third, Tor routes your connection through more than one Tor relay so no single relay can learn what you’re up to. Because these relays are run by different individuals or organizations, distributing trust provides more security than the old one hop proxy approach.”
Goals
Employing machine learning techniques, I intend to be able to classify if certain network traffic is Tor or not by analyzing the traffic flows, being inspired by the paper by Lashkari et al. (2017) [1]. By utilizing ML models, I shall be able to downgrade user privacy to some extent by exposing if a) the user is currently using Tor and b) finding out which activities he or she is conducting (browsing, chatting, downloading, and so on).
On the dataset
The dataset used is published by the University of New Brunswick, under the Canadian Institute for Cybersecurity. It can be found under https://www.unb.ca/cic/datasets/tor.html.
Understanding the data
We are provided with two scenarios, A and B. We shall use the first to classify between TOR and non-TOR activity, while the second we shall use to characterize which activities the user is conducting, which may be audio streaming, browsing, chatting, file-transfer, mail, P2P (such as torrenting), video streaming and VoIP.
We are provided with the source IP and destination IP of the machines, source and destination ports, protocol used, features and labels. The features are, as defined by the authors of the dataset:
- FIAT: Forward Interval Arrival Time, he time between two packets sent forward direction (mean, min, max, std).
- BIAT: Backward Inter Arrival Time, the time between two packets sent backwards (mean, min, max, std).
- Flow IAT: Flow Inter Arrival Time, the time between two packets sent in either direction (mean, min, max, std).
- Active: The amount of time time a flow was active before going idle (mean, min, max, std).
- Idle: The amount of time time a flow was idle before becoming active (mean, min, max, std).
- Flow Bytes/s: Flow Bytes per second.
- Flow Packets/s: Flow Packets per second.
- Duration: the duration of the flow.
Scenario A: TOR vs non-TOR traffic
On the first scenario, I decided to check the quantity of rows that were classified as being TOR traffic and the ones that were not.
The dataset is imbalanced, as is indicated by the great difference between the labels. We may already infer that the dataset probably may have some problems such as high accuracy and low recall score, due to uneven distribution of observations.
We also see that the features are mostly right-skewed. Flow duration is the only one that falls off, having a considerable density on the rightmost side of the plot.
Scenario B: characterizing the usage
Scenario B poses us with the problem of discovering how the user is using the TOR network. Following this, all recorded traffic in scenario B occurred using Tor. We are provided with the following usage types:
- Audio traffic was captured from any continuous stream of data from Spotify.
- Browsing is any HTTP and HTTPS traffic generated by users while on Chrome or Firefox.
- Chatting identifies instant-messaging apps, such as Facebook, Skype, ICQ, etc.
- File-transfer identifies traffic that occurred through SFTP, FTPS and Skype file transfers.
- Mail identifies traffic that, obviously, delivered or received mail through SMTP/S, POP3/SSL and IMAP/SSL.
- P2P is used to share file-sharing protocols like torrenting.
- Video traffic was captured from any continuous stream of data from YouTube and Vimeo using Chrome and Firefox.
- VoIP groups all traffic generated by voice applications, such as Facebook, Hangouts and Skype.
Scenario B follows the same pattern as scenario B, with all features being right-skewed except flow duration. Flow duration, in this case, is distributed almost totally in the rightmost side of the plot, indicating that TOR usage is correlated to a greater flow duration.
Scenario A: Classifying between TOR and non-TOR traffic
The libraries that I shall use are traditional data analysis and machine learning libraries, such as pandas, numpy, matplotlib, seaborn and scikit-learn.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsfrom sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn import set_config
First, we shall define pipelines for our variables. Protocols are qualitative nominal variables, so I shall use OneHotEncoder for those, and standard scaling for the feature variables.
processor_1 = ('OneHotEncoder', OneHotEncoder(), [' Protocol'])
processor_2 = ('StdScaler', StandardScaler(), [' Flow Duration', ' Flow Bytes/s', ' Flow Packets/s', ' Flow IAT Mean', 'Fwd IAT Mean', 'Bwd IAT Mean', 'Active Mean', 'Idle Mean'])preprocessor = ColumnTransformer( [processor_1, processor_2] )
After defining the preprocessor, I generated a function that automatically fits and predict values for any given dataset.
def quickFit(modelName, model, X_train, X_test, y_train, y_test):
"""Fits a model to a given dataset and displays accuracy, precision score and recall score. The function supposes that a preprocessor has already been created.""" global preprocessor model = Pipeline(steps=[('Preprocessing', preprocessor), (modelName, model)])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('\n----- ' + modelName + ' -----')
print(confusion_matrix(y_test, y_pred))
print('Accuracy score: ' + str(accuracy_score(y_test, y_pred)))
print('Precision score: ' + str(precision_score(y_test, y_pred)))
print('Recall score: ' + str(recall_score(y_test, y_pred)))
Finally, I have balanced the dataset so we can have the same instances of non-Tor and Tor usage to train our model. That has helped to improve our metrics, giving us greater scores.
Using quickFit, we now may start to generate models so we can classify between TOR and non-TOR traffic.
Generating our models
I have established the random forest classifier as the benchmark model, on which we will compare the accuracy of other models. It has performed pretty well, with an accuracy score of 98%, precision score of 93% and recall score of 92%.
Logistic regression has performed poorly compared to the benchmark model, and thus should be discarded. Accuracy was high, but precision was only 67% and recall only 73%. As a matter of fact, the model performed worse than before balancing the dataset.
K-Neighbors classifier was the last model I’ve decided to implement to test against the benchmark. It has performed poorly compared to the random forest classifier, but better than the logistic regression model.
Results
Given the above stats, it is apparent that random forest classifier is the best model to distinguish between TOR and non-TOR traffic.
Scenario B: Characterizing usage
Now that we can distinguish between TOR and non-TOR traffic, if we want to reduce user privacy we need to discover what are they doing. As said before, users may browse, stream audio, video, e-mail, and so on. We shall implement the same models to predict user behavior according to their TOR traffic.
For these models, I have generated another function to speed up the process of fitting and displaying model accuracy.
def quickFit(modelName, model, X, y):
global preprocessor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) le = LabelEncoder()
le.fit_transform(y_train)
le.transform(y_test) model = Pipeline(steps=[('Preprocessor', preprocessor), (modelName, model)])
model.fit(X_train, y_train)
y_pred = model.predict(X_test) print("\n-----" + modelName + ' -----')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
I shall use again a random forest classifier as the benchmark model.
The model has performed in a quite subpar manner, with only 80% of correct predictions on the weighted average, defined as the average of metric values, such as precision, recall and f1-score, weighted by the support of the class.
VoIP, P2P and file-transfer are the easiest to predict, while chatting and mailing are the hardest. This may be due to the fact that the first group demands more bandwidth and flow data, so it may be easily highlighted compared to other types of usage.
K-Neighbors classifier has performed worse than the benchmark models. It follows the same tendency of easily recognizing VoIP, P2P and file transfers, but missing chatting and mailing.
As we are dealing with a multi-label classification problem, I have decided to use a decision tree model instead of a multinomial logistic regression. The model has performed worse than implementing a random forest classifier, but better than K-Nearest neighbors, scoring 77% on the weighted average. It continued with the tendency of better recognizing VoIP, P2P and file-transfers.
With the above information, we may say that random forest classifier is the best model between the tested ones to find out user activities based on network traffic.
Findings
While the models performs really well when distinguishing between TOR and non-TOR traffic, difficulties arise when classifying user activities on Tor.
Due to the encrypted nature, it is impossible to assert with 100% of certainty which type of traffic is being conducted. Using information such as flow bytes/s, flow packets/s and other network data may help us to reduce some user privacy, but only for some kinds of traffic. It is easy to identify VoIP, torrenting, file-sharing, but detecting e-mail traffic or instant messaging is not viable.
Video traffic has a tendency to be classified as browsing, and this makes sense, as streaming through YouTube and Vimeo is done through web browsers. The same goes for audio. Chatting and mailing are also mislabeled as browsing, something that makes perfect sense, as most of these activities are done through web-based protocols.
The model performance may raise if a way is found to purge browsing characteristics from these other categories, but I leave that to those who have a greater network understanding.
See more on my GitHub: https://github.com/luccagodooy/tor-traffic-classification/blob/main/scenarioA.py
For those interested in a greater analysis of this dataset, going over through the .arff files, I recommend looking to the original paper published by the Canadian Institute of Cybersecurity (CIC) and the University of New Brunswick:
[1] Arash Habibi Lashkari, Gerard Draper-Gil, Mohammad Saiful Islam Mamun and Ali A. Ghorbani, “Characterization of Tor Traffic Using Time Based Features”, In the proceeding of the 3rd International Conference on Information System Security and Privacy, SCITEPRESS, Porto, Portugal, 2017.
https://pdfs.semanticscholar.org/d76f/32eb3af1a163c0fde624e9fc229671ca75b6.pdf