Torpeda is a framework for the construction of web traffic labeled datasets.
The aim of Torpeda is to provide a bunch of datasets in order to evaluate and compare the effectiveness of Web Application Firewalls.

All datasets included in Torpeda:

Torpeda is open to the contributions of new datasets, builded by other researchers and collaborators.

Motivation

Evaluation is one of the main issues that researchers encounter when proposing a new system for web attack detection. Performance measurements are heavily related with the concrete data used for the evaluation. This fact makes it difficult to compare different systems using different evaluation datasets.

There exist some public datasets available to the community. However, very few of then are designed specifically to test WAFs. Torpeda aims to fill this gap, offering a common structure to create new web-based datasets in order to test and compare different detectors.

Structure

A Torpeda dataset is presented as a XML document.
A dataset contains an arbitrary number of samples (the larger number of samples, the better), that are identified by a unique id. The document looks like this.

General structure of a Torpeda Dataset

Each sample represent a labelled HTTP request, and contain two major parts: captured data and labelling data.

The captured data section represents the description of the request itself. It contains only data that can be observed through a web sniffer. The request is described by its different components:

The labelling data section describes how the sample is classified by an expert. It contains data that the detector is supposed to predict. The following labels describe this section:

Torpeda GET sample Torpeda POST sample

Collaboration

Collaboration is the main point of Torpeda. Our goal is to provide not only a specification and tools to facilitate the construction of new datasets, but also a way to share and make them public.

To this end, we expect to collect as many datasets as possible and make them available to the community. This process is dynamic. Since web attacks (and detection systems) are continously changing, it becomes necesary to have more specific traffic datasets at disposal.

If you want to contribute with your data, contact us.