Reformatting Argus Flow Leve Data for Datapository
The following guide describes how to convert Argus flow level text data for insertion in to the Datapository FLOWS table.
Argus Flow Level Format
The original Argus flow level format is as follows:
Fields:
- StartT: the start time in seconds of the flow
- FinT: the finish time in seconds of the flow
- Left_IP_Port: the left IP address and the port
- Flow_Dir: a character representation of the flow direction
- Right_IP_Port: the right IP address of the flow
- Src_P: the source packet count
- Dst_P: the destination packet count
- Src_B: the source byte count
- State: the final state of the connection when it was recorded
Our Argus dataset has these flows aggregated in to five minute intervals, stored in compressed files such as core-full.2005.02.01.02.55.gz
The format of the filenames are core-full.year.month.day.hour.minute.gz in which the timestamp embedded in the filename represents the start of the five minute interval that all of the flows it contains belongs to. These files are stored in directories which also represent their interval aggregation: Data/archive/2005/02/01/02/core-full.2005.02.01.02.55.gz, such that: Data/archive/year/month/day/...
Parsing the Argus Flow Level Data
The chosen output of the Argus data has made parsing it to a more universal format slightly painful.
Separating an IP address and port with a '.' is a bad idea. This is especially when not all flow records have an associated port, such as ICMP data. If you consider everything past the last '.' the port, you will improperly parse ICMP data for instance. You must count the '.' and determine if a port exists, and then parse it appropriately. This could have been simplified by splitting the IP address and port with a space, as everything else is.
Recording the protocol in text format, which is not necessarily universal, is bad for a universal storage repository. It is converted to the protocol number which is universal. The State field has this same problem, but I am unaware of anything universal for it so it was kept in this format.
Having a flow direction which must be parsed for each flow would also make queries painful in a database. Directionality is not always determinable by the Argus auditing tool. When directionality is unknown, a number of heuristics are performed to determine it and if it is still unknown it is marked in the database as unknown. Otherwise, the flow is converted in to a format in which there is a source IP address and a destination IP address, not a left and right with a flow direction.
Reformatting Tool
In our code repository, there is a conversion tool (dp_reformat) which takes a data path full of compressed Argus flow level data files and converts them to Datapository format for insertion.
The tool will parse the Argus flow level data and scp it over to Datapository where it can be inserted in to the database.
To use the tool, you specify a path which contains the Argus flow level data and it will traverse it recursively and output the data in to a file such as core-full.2005.02.01.02.55.gz-dp.
An example usage for converting all of the data from February of 2005, which also displays its current status, is:
$ ./dp_reformat /mnt/campus-2005-1TB/Data/archive/2005/02 0 / 7726 | /mnt/campus-2005-1TB/Data/archive/2005/02/01/00/core-full.2005.02.01.00.00.gz 1 / 7726 | /mnt/campus-2005-1TB/Data/archive/2005/02/01/00/core-full.2005.02.01.00.05.gz 2 / 7726 | /mnt/campus-2005-1TB/Data/archive/2005/02/01/00/core-full.2005.02.01.00.10.gz 3 / 7726 | /mnt/campus-2005-1TB/Data/archive/2005/02/01/00/core-full.2005.02.01.00.15.gz
Currently, the command line parameter is not implemented yet (it always parses february) and scping the files to Datapository is hard coded to my directory. This should be changed for general use.

