| | 1 | = Partitioning the Flows Table = |
|---|
| | 2 | |
|---|
| | 3 | The flows table is the largest within the database. It currently contains more than 2.5 billion flows and can easily grow in size as more datasets are introduced. This creates a performance problem when running queries on the table, which for entropy-based anomaly detection typically run on a single interval at a time. For example, when calculating entropy you calculate it on some aggregate amount of flows which are five minutes worth of flows within our study. The optimization would then be to make sure that when running a query like such, it only needs to parse the couple hundred thousand flows in the interval rather than a couple billion flows. |
|---|
| | 4 | |
|---|
| | 5 | To achieve this, the ''[wiki:EntityDictionary#FLOWS FLOWS]'' table is partitioned in to five minute intervals. This is optimal for 99.999% of all queries run on the database. Entropy, top hosts in terms of a metric, and synthetic analysis are all done in terms of five minute intervals. All attributes are inherited from the master ''[wiki:EntityDictionary#FLOWS FLOWS]'' table and the ''[wiki:EntityDictionary#FLOWS flow_id]'' attribute is unique only to each partition. The ''[wiki:EntityDictionary#FLOWS flow_id]'' attribute is important for the manual labeling of flows in to observed and confirmed attacks. |
|---|
| | 6 | |
|---|
| | 7 | == Generating the Partitions == |
|---|
| | 8 | |
|---|
| | 9 | Generating the partitions is done in conjunction with inserting flows in to the table. This keeps the database optimal, ensuring that partitions are only created when flows actually exist for the interval it represents. However we will describe the partitions here. |
|---|
| | 10 | |
|---|
| | 11 | The partitions are broken down in to five minute intervals such that the name of the partition is ''flows_epoch'' where ''epoch'' is the time in seconds that have passed since the ''[http://en.wikipedia.org/wiki/Epoch_%28reference_date%29 epoch]'' at the start time of the interval. For example, the original [http://www.qosient.com/argus/ Argus] flow data for February 01, 2005 00:00:00 is stored in core-full.2005.01.00.00.gz. The ''[http://en.wikipedia.org/wiki/Epoch_%28reference_date%29 epoch]'' representation for this interval is 1107234000, therefore all of the flows for this interval are stored in the partition ''flows_1107234000''. It must also be ensured that only flows that belong to this interval are within this partition, this is done with a [http://www.postgresql.org/docs/8.1/static/ddl-constraints.html CHECK constraint]. |
|---|
| | 12 | |
|---|
| | 13 | Creating this partition in [http://datapository.net Datapository] would be done as follows: |
|---|
| | 14 | {{{ |
|---|
| | 15 | CREATE TABLE flows_1107234000 |
|---|
| | 16 | (CHECK (interval='2005-02-01 00:00:00')) |
|---|
| | 17 | INHERITS (flows); |
|---|
| | 18 | }}} |
|---|
| | 19 | |
|---|
| | 20 | Inserting flows in to this partition can be done in two ways, ensuring the flows go directly in to ''flows_1107234000'' or creating an insertion rule on the ''[wiki:EntityDictionary#FLOWS FLOWS]'' table such that any flow insertions with an interval='2005-02-01 00:00:00' go to this partition. For performance when inserting bulk data we insert directly in to the partition, however we also create the insertion rule for subsequent insertions. |
|---|
| | 21 | |
|---|
| | 22 | Creating the insertion rule on the ''[wiki:EntityDictionary#FLOWS FLOWS]'' table would be done as follows: |
|---|
| | 23 | {{{ |
|---|
| | 24 | CREATE RULE flows_insert_1107234000 AS |
|---|
| | 25 | ON INSERT TO flows |
|---|
| | 26 | WHERE (interval='2005-02-01 00:00:00') |
|---|
| | 27 | DO INSTEAD |
|---|
| | 28 | INSERT INTO flows_1107234000 |
|---|
| | 29 | VALUES (NEW.interval, |
|---|
| | 30 | NEW.start_time, |
|---|
| | 31 | NEW.finish_time, |
|---|
| | 32 | NEW.protocol, |
|---|
| | 33 | NEW.src_ip, |
|---|
| | 34 | NEW.dst_ip, |
|---|
| | 35 | NEW.src_port, |
|---|
| | 36 | NEW.dst_port, |
|---|
| | 37 | NEW.src_packets, |
|---|
| | 38 | NEW.dst_packets, |
|---|
| | 39 | NEW.src_bytes, |
|---|
| | 40 | NEW.dst_bytes, |
|---|
| | 41 | NEW.state, |
|---|
| | 42 | NEW.dir_unknown, |
|---|
| | 43 | NEW.flow_id); |
|---|
| | 44 | }}} |
|---|
| | 45 | |
|---|
| | 46 | Flows can then be inserted directly in to the ''[wiki:EntityDictionary#FLOWS FLOWS]'' table without worry of determining which partition to put them in. |
|---|
| | 47 | |
|---|
| | 48 | == Inserting Flows == |
|---|
| | 49 | |
|---|
| | 50 | A [source:scripts/ruby/dp_insert_flows.rb dp_insert_flows.rb] script is provided to assist in the insertion of flows contained in [http://datapository.net Datapository] [wiki:ReformatForDP flow formatted files]. |
|---|
| | 51 | |
|---|
| | 52 | The script provides several functions: |
|---|
| | 53 | 1. Creating a partition in the ''[wiki:EntityDictionary#FLOWS FLOWS]'' table |
|---|
| | 54 | 1. Creating an insertion rule for flows of the given interval into the partition |
|---|
| | 55 | 1. Inserting the actual flows in the partition |
|---|
| | 56 | |
|---|
| | 57 | Sample usage of the [source:scripts/ruby/dp_insert_flows.rb dp_insert_flows.rb] script is as follows: |
|---|
| | 58 | {{{ |
|---|
| | 59 | -bash-2.05b$ ./dp_insert_flows.rb core-full.2005.02.01.01.00.gz-db |
|---|
| | 60 | }}} |