Changes from Version 1 of PartitioningFlowsTable

Show
Ignore:
Author:
trac (IP: 127.0.0.1)
Timestamp:
06/14/07 15:50:47 (3 years ago)
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • PartitioningFlowsTable

    v0 v1  
     1= Partitioning the Flows Table = 
     2 
     3The flows table is the largest within the database.  It currently contains more than 2.5 billion flows and can easily grow in size as more datasets are introduced.  This creates a performance problem when running queries on the table, which for entropy-based anomaly detection typically run on a single interval at a time.  For example, when calculating entropy you calculate it on some aggregate amount of flows which are five minutes worth of flows within our study.  The optimization would then be to make sure that when running a query like such, it only needs to parse the couple hundred thousand flows in the interval rather than a couple billion flows. 
     4 
     5To achieve this, the ''[wiki:EntityDictionary#FLOWS FLOWS]'' table is partitioned in to five minute intervals.  This is optimal for 99.999% of all queries run on the database.  Entropy, top hosts in terms of a metric, and synthetic analysis are all done in terms of five minute intervals.  All attributes are inherited from the master ''[wiki:EntityDictionary#FLOWS FLOWS]'' table and the ''[wiki:EntityDictionary#FLOWS flow_id]'' attribute is unique only to each partition.  The ''[wiki:EntityDictionary#FLOWS flow_id]'' attribute is important for the manual labeling of flows in to observed and confirmed attacks. 
     6 
     7== Generating the Partitions == 
     8 
     9Generating the partitions is done in conjunction with inserting flows in to the table.  This keeps the database optimal, ensuring that partitions are only created when flows actually exist for the interval it represents.  However we will describe the partitions here. 
     10 
     11The partitions are broken down in to five minute intervals such that the name of the partition is ''flows_epoch'' where ''epoch'' is the time in seconds that have passed since the ''[http://en.wikipedia.org/wiki/Epoch_%28reference_date%29 epoch]'' at the start time of the interval.  For example, the original [http://www.qosient.com/argus/ Argus] flow data for February 01, 2005 00:00:00 is stored in core-full.2005.01.00.00.gz.  The ''[http://en.wikipedia.org/wiki/Epoch_%28reference_date%29 epoch]'' representation for this interval is 1107234000, therefore all of the flows for this interval are stored in the partition ''flows_1107234000''.  It must also be ensured that only flows that belong to this interval are within this partition, this is done with a [http://www.postgresql.org/docs/8.1/static/ddl-constraints.html CHECK constraint]. 
     12 
     13Creating this partition in [http://datapository.net Datapository] would be done as follows: 
     14{{{ 
     15   CREATE TABLE flows_1107234000 
     16    (CHECK (interval='2005-02-01 00:00:00'))  
     17   INHERITS (flows); 
     18}}} 
     19 
     20Inserting flows in to this partition can be done in two ways, ensuring the flows go directly in to ''flows_1107234000'' or creating an insertion rule on the ''[wiki:EntityDictionary#FLOWS FLOWS]'' table such that any flow insertions with an interval='2005-02-01 00:00:00' go to this partition.  For performance when inserting bulk data we insert directly in to the partition, however we also create the insertion rule for subsequent insertions. 
     21 
     22Creating the insertion rule on the ''[wiki:EntityDictionary#FLOWS FLOWS]'' table would be done as follows: 
     23{{{ 
     24   CREATE RULE flows_insert_1107234000 AS 
     25   ON INSERT TO flows  
     26      WHERE (interval='2005-02-01 00:00:00') 
     27   DO INSTEAD 
     28      INSERT INTO flows_1107234000 
     29      VALUES (NEW.interval, 
     30              NEW.start_time, 
     31              NEW.finish_time, 
     32              NEW.protocol, 
     33              NEW.src_ip, 
     34              NEW.dst_ip, 
     35              NEW.src_port, 
     36              NEW.dst_port, 
     37              NEW.src_packets, 
     38              NEW.dst_packets, 
     39              NEW.src_bytes, 
     40              NEW.dst_bytes, 
     41              NEW.state, 
     42              NEW.dir_unknown, 
     43              NEW.flow_id); 
     44}}} 
     45 
     46Flows can then be inserted directly in to the ''[wiki:EntityDictionary#FLOWS FLOWS]'' table without worry of determining which partition to put them in. 
     47 
     48== Inserting Flows == 
     49 
     50A [source:scripts/ruby/dp_insert_flows.rb dp_insert_flows.rb] script is provided to assist in the insertion of flows contained in [http://datapository.net Datapository] [wiki:ReformatForDP flow formatted files]. 
     51 
     52The script provides several functions: 
     53  1. Creating a partition in the ''[wiki:EntityDictionary#FLOWS FLOWS]'' table 
     54  1. Creating an insertion rule for flows of the given interval into the partition 
     55  1. Inserting the actual flows in the partition 
     56 
     57Sample usage of the [source:scripts/ruby/dp_insert_flows.rb dp_insert_flows.rb] script is as follows: 
     58{{{ 
     59   -bash-2.05b$ ./dp_insert_flows.rb core-full.2005.02.01.01.00.gz-db 
     60}}}