sql-data-profiler
sql-data-profiler is a utility module that generates sql code to profile data sets (eg. tables) in Redshift.
Installation
npm install sql-data-profiler --save
Quick start
var sql_data_profiler = ;
Table profiler
Generates a sql statement that provides basic stats on the data in the "contacts" table.
var data_profiler = sql_data_profilerdata_profiler; var options = target_table: 'contacts' target_columns: 'email' 'a_industry';var sql_code = ;
which accepts the following options
- target_table
- target_columns
- results_table
- calculate_frequency
- use_perm_table
- truncate_table
Distribution analysis
** What does this do?**
var distribution_analyzer = sql_data_profilerdistribution_analyzer; var sql_code = ;
Table stats
The following stats are calculated for each column.
Stats | Description |
---|---|
count_total | number of records in the table |
count_not_null | number of records where the value for the specified column is not null |
fill_rate | number of non-null values divided by number of records |
count_distinct | number of distinct values |
dupe_rate | number of distinct values divided by number of records where the value for the specified column is not null |
maximum_value | |
minimum_value | |
most_frequent_value_1 | |
most_frequent_value_1_frequency | |
most_frequent_value_2 | |
most_frequent_value_2 | |
most_frequent_value_3 | |
most_frequent_value_3 |
TO DO
- handle different data type (eg. boolean)
- performance improvement