Partitioning the Flows Table
The flows table is the largest within the database. It currently contains more than 2.5 billion flows and can easily grow in size as more datasets are introduced. This creates a performance problem when running queries on the table, which for entropy-based anomaly detection typically run on a single interval at a time. For example, when calculating entropy you calculate it on some aggregate amount of flows which are five minutes worth of flows within our study. The optimization would then be to make sure that when running a query like such, it only needs to parse the couple hundred thousand flows in the interval rather than a couple billion flows.
To achieve this, the FLOWS table is partitioned in to five minute intervals. This is optimal for 99.999% of all queries run on the database. Entropy, top hosts in terms of a metric, and synthetic analysis are all done in terms of five minute intervals. All attributes are inherited from the master FLOWS table and the flow_id attribute is unique only to each partition. The flow_id attribute is important for the manual labeling of flows in to observed and confirmed attacks.
Generating the Partitions
Generating the partitions is done in conjunction with inserting flows in to the table. This keeps the database optimal, ensuring that partitions are only created when flows actually exist for the interval it represents. However we will describe the partitions here.
The partitions are broken down in to five minute intervals such that the name of the partition is flows_epoch where epoch is the time in seconds that have passed since the epoch at the start time of the interval. For example, the original Argus flow data for February 01, 2005 00:00:00 is stored in core-full.2005.01.00.00.gz. The epoch representation for this interval is 1107234000, therefore all of the flows for this interval are stored in the partition flows_1107234000. It must also be ensured that only flows that belong to this interval are within this partition, this is done with a CHECK constraint.
Creating this partition in Datapository would be done as follows:
CREATE TABLE flows_1107234000
(CHECK (interval='2005-02-01 00:00:00'))
INHERITS (flows);
Inserting flows in to this partition can be done in two ways, ensuring the flows go directly in to flows_1107234000 or creating an insertion rule on the FLOWS table such that any flow insertions with an interval='2005-02-01 00:00:00' go to this partition. For performance when inserting bulk data we insert directly in to the partition, however we also create the insertion rule for subsequent insertions.
Creating the insertion rule on the FLOWS table would be done as follows:
CREATE RULE flows_insert_1107234000 AS
ON INSERT TO flows
WHERE (interval='2005-02-01 00:00:00')
DO INSTEAD
INSERT INTO flows_1107234000
VALUES (NEW.interval,
NEW.start_time,
NEW.finish_time,
NEW.protocol,
NEW.src_ip,
NEW.dst_ip,
NEW.src_port,
NEW.dst_port,
NEW.src_packets,
NEW.dst_packets,
NEW.src_bytes,
NEW.dst_bytes,
NEW.state,
NEW.dir_unknown,
NEW.flow_id);
Flows can then be inserted directly in to the FLOWS table without worry of determining which partition to put them in.
Inserting Flows
A dp_insert_flows.rb script is provided to assist in the insertion of flows contained in Datapository flow formatted files.
The script provides several functions:
- Creating a partition in the FLOWS table
- Creating an insertion rule for flows of the given interval into the partition
- Inserting the actual flows in the partition
Sample usage of the dp_insert_flows.rb script is as follows:
-bash-2.05b$ ./dp_insert_flows.rb core-full.2005.02.01.01.00.gz-db
