Correlation
When comparing traffic metrics or anomaly detection methods, computing correlation coefficient scores between two entities can be useful for understanding their similarities and differences. Correlation of entropy data can give a better understanding of how similar two metrics are in the raw inferences they can make from the network. Correlation of alarms across different traffic metrics with a single anomaly detection method can give a high level view of what metrics can generate similar sets of alarms. Correlating alarm data using two different anomaly detection methods with a single metric can also be done to understand the similarities and differences of different methods.
Generating Correlation Scores
Generating correlation scores can easily be done using the correlation function in MATLAB. However we provide several scripts to aid in computing correlation scores across numerous metrics, methods, or any other data generated by Datapository.
- Usage: correlate <data_file_1> <data_file_2> <label_1> <label_2> <out_file>
- Description: correlate the second column in two data files and append their correlation to the specified outfile
- Example usage:
>> correlate entropy-degree_in entropy-degree_out degree_in degree_out correlation_scores ans = 0.0552 >> exit $ cat correlation_scores degree_in degree_out 0.055236
When trying to correlate numerous pieces of data, this can be a tedious effort. To simplify the process a ruby script is provided as a wrapper for running correlations in MATLAB. The output is formatted as a matrix to make it easily readable. To perform correlation of multiple data files easily, the gen_correlations.rb script can be used. Since the data typically represents a timeseries, the first column is the timestamp and the second is the data value. The data need not be entropy data, however for now the actual data being correlated must be the second column in the file to work with the script.
- Usage: ./gen_correlations <file_with_data_list>
- Description: reads in a list of data files from the command line argument which is in the format: <label> <path_to_data>, then correlates all of the data, and displays it as a latex table
- Example usage:
$ cat metrics_entropy indegree ../../traffic_data/entropy/entropy-degree_in outdegree ../../traffic_data/entropy/entropy-degree_out addr_src ../../traffic_data/entropy/entropy-addr_src addr_dst ../../traffic_data/entropy/entropy-addr_dst ports_src ../../traffic_data/entropy/entropy-ports_src ports_dst ../../traffic_data/entropy/entropy-ports_dst fsd ../../traffic_data/entropy/entropy-fsd $ ./gen_correlations.rb metrics_entropy \hline &outdegree &addr_src &addr_dst &ports_src &ports_dst & fsd \\ \hline indegree& -1.0000& -1.0000& -1.0000& -1.0000& -1.0000& -1.0000\\ \hline outdegree& -& -1.0000& -1.0000& -1.0000& -1.0000& -1.0000\\ \hline addr_src& -& -& -1.0000& -1.0000& -1.0000& -1.0000\\ \hline addr_dst& -& -& -& -1.0000& -1.0000& -1.0000\\ \hline ports_src& -& -& -& -& -1.0000& -1.0000\\ \hline ports_dst& -& -& -& -& -& -1.0000\\ \hline
