Menu Close

Anonymization of the KIS 2018 dataset

Anonymization of the KIS 2018 dataset

  • Authors : Tomáš Mokoš, Miroslav Kohútik

KIS 2018 is a network dataset created by the Department of Information Networks of the Faculty of Management Science and Informatics, University of Žilina.

Network datasets serve for the purpose of training of network security systems, namely IDS and IPS. These systems have to be able to differentiate between common benign traffic and attack traffic, therefore network datasets must reflect the real traffic that contains both of the traffic types as best as possible.

Considering that we wanted to make the KIS 2018 dataset accessible to public, it was necessary to remove all the sensitive data that might be harmful if used against us. For the purpose of anonymizaton we have used TraceWrangler packet capture toolkit.

The maximum size of PCAP file that TraceWrangler can process is 2GB, therefore, it is necessary to split up the dataset to individual files, each less than 2GB in size. The KIS 2018 dataset consists of 19 PCAP files, 11 of which are larger than 2GB.

We have used editcap to split up the individual files with commands of the following format:

editcap -c 1000000 C:\_Datasety\KIS_2018\Dataset\moloch-180418-01.pcap C:\_Datasety\KIS_2018\Dataset\Split\moloch-180418-01-split-.pcap

The -c 1000000 option defines the maximum number of packets in the files resulting from the splitting of the original PCAP file (C:_Datasety\KIS_2018\Dataset\moloch-180418-01.pcap) The path C:_Datasety\KIS_2018\Dataset\Split\moloch-180418-01-split-.pcap defines the destination and names of the resulting files. The given PCAP file was split up to three parts.

After starting TraceWrangler you can import the PCAP file by clicking on the Files -> add File button. After being imported, the file is scanned and its metadata is displayed as illustrated below.


After clicking on Anonymize Files, a menu with options applicable to the given PCAP file is displayed. Options offered in this menu include: removing layers unknown to TraceWrangler by cutting away or replacing the corresponding bytes and replacing ports, VLAN IDs, MAC addresses and others.

Anonymization options using TraceWrangler

For example, in the IPv4 menu you can replace IP addresses. If you want to anonymize a large amount of IP addresses, it would be illogical to replace each one with a manually entered address. For this purpose you can check “Replace IP addresses by subnet” and pick “keep host part” from the list of options. Check “Recalculate CRC” and pick “Keep bad checksums bad” if needed.

IPv4 anonymization using TraceWrangler

Finally, in the Output settings you can pick the directory to which you want to save the files. If you set filename to < filename>_anonymized, the resulting file’s name will be the original file’s name with the string _anonymized appended. Confirm the setting by clicking “Okay” and click “Run” to start anonymization.

Error 103 warning may appear after starting anonymization, this warning represents an input/output problem. This problem occurs during write operation in a directory to which user cannot read or write.

Another possible problem may occur during anonymization of some PCAP files. The creator of TraceWrangler has not removed this bug yet, because it occurs randomly. According to him, this bug occurs when working with a lot of packets, usually with files larger than 50 MB. The creator advised us to split up the files to slices of less than 50 MB and merge them together after anonymization.

We have used mergecap, a part of Wireshark, to merge the PCAP files together. Wireshark also provides file merging through GUI, however this is supported for two files at a time only. In our case, this would be very time consuming, therefore, we have used command line interface.

The command we have used to merge the files:

mergecap.exe -w C:\_Datasety\KIS_2018\Dataset\Anonymized\Moloch_merged.pcapng

The -w option specifies the output file and all of the other paths specify the files to be merged. Files are merged chronologically according to their timestamps. The resulting PCAP file had 51.7 GB.

THe dataset contains some duplicate packets, which needed to be removed. To remove the duplicities we have used editcap with the -w option. This option expects < dup time window> as an input parameter, this parameter denotes the size of the time window inside of which the duplicities are tested for. To properly identify the duplicities, it is necessary for the packets to be chronologically ordered. A reasonable size of this window must be picked, otherwise the testing of large PCAP files may be very time consuming.

The size of the time window we have picked was 0.1 seconds.

The command we have used:

editcap.exe -w 0.1 C:\_Datasety\KIS_2018\Dataset\Anonymized\Moloch_merged.pcapng C:\_Datasety\KIS_2018\Dataset\Anonymized\Moloch_dedup.pcapng

The size of the PCAP file after removing duplicities is 50.5 GB.

Leave a Reply

Your email address will not be published. Required fields are marked *