NEtwork MEssage Syntax analysYS (WOOT 2018) and NEtwork MEssage TYpe identification by aLignment (INFOCOM 2020)

vs-uulm, updated 🕥 2022-10-30 16:47:30

NEMESYS, NEMETYL, and followup work

This repository contains one integrated proof-of-concept implementation of multiple concepts, methods, and algoritms developed during the dissertation of Stephan Kleber, ORCID 0000-0001-9836-4897. The implemention contains the following concepts, methods, and algorithms:


NEMESYS: NEtwork MEssage Syntax analysYS and FMS: Format Match Score

Paper authors: Stephan Kleber, Henning Kopp, and Frank Kargl, Institute of Distributed Systems, Ulm University

https://www.usenix.org/conference/woot18/presentation/kleber

On usage please cite as:

S. Kleber, H. Kopp, and F. Kargl: „NEMESYS: Network Message Syntax Reverse Engineering by Analysis of the Intrinsic Structure of Individual Messages“, in 12th USENIX Workshop on Offensive Technologies, Baltimore, MD, USA, 2018.

BibTeX


NEMETYL: NEtwork MEssage TYpe identification by aLignment

Paper authors: Stephan Kleber, Rens W. van der Heijden, and Frank Kargl, Institute of Distributed Systems, Ulm University

https://arxiv.org/abs/2002.03391

On usage please cite as:

S. Kleber, R. W. van der Heijden, and F. Kargl: „Message Type Identification of Binary Network Protocols using Continuous Segment Similarity“, in IEEE International Conference on Computer Communications, 2020.


Network Message Field Type Classification

Paper authors: Stephan Kleber, Frank Kargl, Institute of Distributed Systems, Ulm University and Milan Stute, Matthias Hollick, Secure Mobile Networking Lab, Technical University of Darmstadt

On usage please cite as:

Stephan Kleber, Milan Stute, Matthias Hollick, and Frank Kargl: „Network Message Field Type Classification and Recognition for Unknown Binary Protocols“. In Proceedings of the DSN Workshop on Data-Centric Dependability and Security. DCDS. Baltimore, Maryland, USA: IEEE/IFIP, 2022.


PCA for Network Message Segmentation

Paper authors: Stephan Kleber and Frank Kargl, Institute of Distributed Systems, Ulm University

On usage please cite as:

Stephan Kleber and Frank Kargl: „Refining Network Message Segmentation with Principal Component Analysis“. In Proceedings of the tenth annual IEEE Conference on Communications and Network Security. CNS. Austin, TX, USA: IEEE, 2022.


Release: CNS2022

Code author: Stephan Kleber ([email protected]), Institute of Distributed Systems, Ulm University

NEMESYS is a novel method to infer structure from network messages of binary protocols. The method derives field boundaries from the distribution of value changes throughout individual messages. Our method exploits the intrinsic features of structure which are contained within each single message instead of comparing multiple messages with each other.

Additionally, this repository contains the reference implementation for calculating the Format Match Score: It is the first quantitative measure of the quality of a message format inference.

NEMETYL is a novel method for discriminating protocol message types from each other. It uses structural features of binary protocols inferred by NEMESYS to accurately recognize structural patterns and cluster messages based on their common structure.

Network Message Field Type Classification is the first generic method to analyze message field data types in unknown binary protocols by clustering of segments with the same data type as a kind of semantic deduction.

PCA for Network Message Segmentation is a method to refine the approximation of the field inference. It uses principle component analysis (PCA) to discover linearly correlated variance between sets of message segments. We relocate the boundaries of the initial coarse segmentation to more accurately match with the true fields.

NEMESYS, FMS, NEMETYL, the Field Type Classification, and PCA-refined Segmentation are indented to be used as a library to be integrated into your own scripts. However, you can also use it interactively with your favorite python shell. Have a look into nemesys.py, nemesys_fms.py, nemesys_pca-refinement.py, nemezero_pca-refinement.py, nemetyl_align-segments.py, resp. nemeftr-prod_cluster-segments.py to get an impression of the basic functionality and how to call it. The other scrpts in src/ show how other aspects of the contained methods can be used and explored. All of these scripts can be called with commane line parameters to immediately start analyzing protocol traces with the default settings also used in the published papers.

Disclaimer

This is highly experimental software and by no means guaranteed to be fit for productive use. In particular, it has not been tested with varying platforms, python environments, and dependencies. If you run into any problems during deployment and usage, please feel highly encouraged to provide feedback or ask questions via eMail ([email protected]), via github issue, or a pull request.

Requirements

  • Python 3
  • libpcap for pcapy: apt-get install libpcap-dev libpq-dev
  • Install packages listed in requirements.txt: pip install -r requirements.txt
    • This necessitates to install libpcap for pcapy: sudo apt-get install libpcap-dev
  • Manual install of Netzob from the "next" branch (the current Netzob version available in the official repository and via PyPI lacks some required fixes):
    • clone Netzob next branch to a local folder: git clone --single-branch -b next https://github.com/netzob/netzob.git
    • install it: python setup.py install
  • tshark version 2.x or 3.x (tested with: 2.2.6, 2.6.3, 2.6.5, 2.6.8, 3.2.3, 3.2.5) (possibly other versions, depending on the compatibility of the JSON-output format of dissected messages, please report further working versions e. g., per github issue)
    Note: NEMESYS can be used without tshark as long as FMS validation (in package validation) against a real dissector is NOT required.

Place your user in the "wireshark" group to enable tshark to run: sudo gpasswd -a $USER wireshark

Dockerfile

There is a Dockerfile provided so that you can use Nemere without worrying about dependencies. The following code builds the image and starts a container with the directory bind as a volume so that input files can be easily used and generated reports can be easily accessed even after stopping the container.

git clone https://github.com/vs-uulm/nemesys.git cd nemesys docker build . -t nemere:latest docker run -ti --mount type=bind,source=$(pwd),target=/nemere/ nemere:latest

For the use of FMS where tshark is going to be used, we need to add the --privileged flag to the run command above.

Sample scripts

There are several scripts to provide a starting point to working with NEMESYS and FMS. The short description in this section provides an overview of their functions.

Globally available options

All scripts provide these command line options:

  • Get help for the command line options by calling each script only with parameter -h
  • You select the network layer to analyze by -l [#] and optionally -r making the given layernumber relative to the IP layer (if any in the trace). It is highly recommended to always include these options since they are a common cause for usage errors. To select the application layer above any transport protocol (regardless whether TCP, UDP, or any other protocol known to tshark) use -r -l2

prep_*

PCAP preparation scripts:

  • prep_deduplicate-trace.py pcapfilename

Detect identical payloads and de-duplicate traces, ignoring encapsulation metadata.

  • prep_filter-maxdiff-trace.py pcapfilename

Filter a PCAP for the subset of packets that have the maximum difference to all other messages.

check_*

Basic checks whether PCAPs are parseable:

relevant only for validation of NEMESYS output by FMS, not NEMESYS itself

The tshark-dissected fields that are contained in the PCAPs need to be known to the message parser. Therefore, validation.messageParser.ParsingConstants needs to be made aware of any field occuring in the traces.

  • check_parse-pcap.py pcapfilename

Parse a PCAP file and print its dissection for testing. This helps verifying if there are any unknown fields that need to be added to validation.messageParser.ParsingConstants. Before starting to validate/use FMS with a new PCAP, first run this check and solve any errors.

  • check_pcap-info.py pcapfilename

Parse PCAP, print some statistics and infos about it, and open an IPython shell.

nemesys_*

Infer messages from PCAPs by the NEMESYS approach (BCDG-segmentation) and write FMS and other evaluation data to reports and plots.

  • nemesys.py pcapfilename

Run NEMESYS and open an interactive python shell (IPython) that provides access to the analysis results.

  • nemesys_fms.py pcapfilename

Run NEMESYS and validate it by calculating the FMS for each inferred message. Writes the results as text files to a timestamped subfolder of reports/

  • nemesys_field-deviation-plot.py pcapfilename

Run NEMESYS and validate it by visualizing the deviation of the true format from each inferred message. Writes the results as plots in PDF files to a timestamped subfolder of reports/

  • nemesys_pca-refinement.py pcapfilename

Reference implementation of the refinement of segments of messages according to their variance measured by PCA as described in our CNS 2022 paper. Writes the results as text files to a timestamped subfolder of reports/

nemetyl_*

Infer message types from PCAPs using Canberra dissimilarity as feature to determine segment similarity. Similar segments are then aligned in multiple messages and clustered based on their alignment scores. The clusters denote the inferred message types of similar structure.

  • nemetyl_align-segments.py pcapfilename

Run NEMETYL and write the analysis results to a subfolder of reports/.

nemeftr_*

Classity Network Message Field Types using Canberra dissimilarity and DBSCAN clustring.

  • nemeftr-prod_cluster-segments.py

Reference implementation for calling NEMEFTR-full mode 1, the NEtwork MEssage Field Type Recognition (DSN 2022), classification of data types, with an unknown protocol.

  • nemeftr_cluster-segments.py

Plot and print dissimilarities between segments for an early state of NEMEFTR and for the evaluation of the epsilon autoconfiguration. Segments are generated from a heuristic segmentation. Output for evaluation are a dissimilarity topology plot and histogram, ECDF plot, clustered vector visualization plots, and segment cluster statistics.

  • nemeftr_cluster-true-fields.py

Plot and print dissimilarities between segments for an early state of NEMEFTR and for the evaluation of the epsilon autoconfiguration. Segments are generated from an optimal baseline segmentation. Output for evaluation are a dissimilarity topology plot and histogram, ECDF plot, clustered vector visualization plots, and segment cluster statistics.

netzob_*

Infer PCAP with Netzob and compare the result to the tshark dissector of the protocol.

  • netzob_fms.py Run Netzob and validate it by calculating the FMS for each inferred message. Writes the results as text files to a timestamped subfolder of reports/

  • netzob_messagetypes.py

Infer PCAP with Netzob and compare the result to the tshark dissector of the protocol.

Issues

when i run prep_filter-maxdiff-trace.py,i get up with this problem,could you tell me how to solve it

opened on 2022-11-06 09:34:41 by aoaowu

a337bd0048b81268197550536d799a6 c857740563880446d271762048c636e the pcap file i used is https://github.com/netplier-tool/NetPlier/blob/master/data/smb2_100.pcap

Slow distance calculator execution time.

opened on 2022-05-11 09:00:05 by phillips96

Hello,

I noticed there's a TODO in inference/templates.py regarding the number of calls to embedSegment and scipy's cdist operation. I'm not sure if you're accepting PR's but I took a stab at it yesterday out of necessity.

Here's one such example of the improvements. Note the results are identical. .... 34523895 segment pairs in 704.63 seconds. # Old .... 34523895 segment pairs in 149.37 seconds. # New That was on a single worker, and with parallelDistanceCalc enabled it went from 310 seconds to 125.

For the most part, I just implemented the changes that had been suggested, such that embedSegment now takes all of the outersegs and makes one cdist call instead of one call for each oseg in outersegs.

If you'd be happy for me to make the PR, i'll make one which implements the following new functions or overrides the old based on your preference;

  • _outerloop
  • embedSegmentGroup
  • __prepareValuesMatrices

Thanks :)

For short input, refinement function throws an error

opened on 2022-03-06 22:36:55 by jaredchandler

When given small (short) input such as these 10 messages...

00000A41 00002934 00005984 000039CB 000021CA 0000486A 00002779 00005374 00004535 000051C1 The refinement function throws an error. ``` Traceback (most recent call last): File "nemesys.py", line 52, in refinedPerMsg = refinements(segmentsPerMsg, unused=None) File "/content/nemesys/src/nemere/inference/segmentHandler.py", line 272, in refinements return nemetylRefinements(segmentsPerMsg) File "/content/nemesys/src/nemere/inference/segmentHandler.py", line 332, in nemetylRefinements splitfixed = refine.SplitFixed(charmerged).split(0, 1) File "/content/nemesys/src/nemere/inference/formatRefinement.py", line 610, in split selSeg = self.segments[segmentID] IndexError: list index out of range ````

Calculate independent FMSs for each symbol message (fixes #15)

opened on 2021-08-10 16:48:24 by techge

symbolListFMS did only calculate FMS for one message, instead of all message of a symbol. Now it is create a DissectorMatcher object for each message of the symbol and thus creates a FMS list of all message of a symbol.

As far as I can see, the TODO is already mentioning the problem I ran into: the code as it is does only add one message, not all. I cleaned it up. I think it is working as intended now. Or am I missing anything here? Works for me so far :) and for netzob_fms.py and nemesys_fms.py as far as I can see...

Would fix #15

Add interactive question which layer to use (fixes #14)

opened on 2021-08-10 16:36:34 by techge

The layer submitted as a parameter is mostly used in regard of netzobs importer/the OSI model and does not match the layer count of Wiresharks dissection. This is why we should ask the user interactively what layer to use while presenting the layers actually present in the first message.

I can imagine that you will hesitate including this PR, but you may consider it.

We are passing through the layernumber and relativeToIP parameter a lot, just to do calculations that do not work for some cases (actually for most of the cases I wanted to use the comparator). I would prefer this not-perfect-either solution to choose the wireshark dissected layer I want to test instead of fiddling in the code to get what I want...

It looks like this:

``` $ python src/nemesys_fms.py -l 2 input/ethernet/test2.pcapng
Load messages...

Wait for tshark output (max 20s)...

Which protocol shall we analyze? 0

Try to remember your decision, but we might need to ask again...

Segment messages... [...] ```

Would fix #14

What to do with encapsulated layers?

opened on 2021-07-20 17:52:14 by techge

The code of nemere (more precisely the FMS part) raises a very interesting question:

what to do with layers after (embedded in) the target protocol

The response is to just include them to the dissections:

```

what to do with layers after (embedded in) the target protocol

if absLayNum < len(self.protocols): for embedded in self.protocols[absLayNum+1 : ]: dissectsub = ParsedMessage._getElementByName(layersvalue, embedded) if isinstance(dissectsub, list): self._dissectfull += dissectsub # else: # print("Bogus protocol layer ignored: {}".format(embedded)) ```

Thing is, I actually want to test the dissections of a protocol that encapsulates layers and the way I see it, FMS should only test the dissection of the protocol itself and not of the whole message, this is why I deactivated this part in my slightly adapted implementation. But I know PRE tools would handle this differently. Do you have any idea, if there is a way to solve this issues elegantly? I am not sure, if this drastic commenting of mine is causing some problems later on.

Institute of Distributed Systems, Ulm University
GitHub Repository