pax_global_header00006660000000000000000000000064144621116510014513gustar00rootroot0000000000000052 comment=8ba5ef841e5115cb99a47f81f5b46c47af9ab8ef aggregate-0.2.4/000077500000000000000000000000001446211165100134445ustar00rootroot00000000000000aggregate-0.2.4/.gitignore000066400000000000000000000000221446211165100154260ustar00rootroot00000000000000Gemfile.lock pkg/ aggregate-0.2.4/.travis.yml000066400000000000000000000003271446211165100155570ustar00rootroot00000000000000arch: - amd64 - ppc64le language: ruby rvm: - 1.9.3 - 2.3 - 2.5 - 2.6 - 2.7 matrix: exclude: - rvm: 1.9.3 arch: ppc64le - rvm: 2.3 arch: ppc64le - rvm: 2.5 arch: ppc64le aggregate-0.2.4/Gemfile000066400000000000000000000000701446211165100147340ustar00rootroot00000000000000source "https://rubygems.org" gem "rake" gem "minitest"aggregate-0.2.4/LICENSE000066400000000000000000000020441446211165100144510ustar00rootroot00000000000000Copyright (c) 2009 Joseph Ruscio Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. aggregate-0.2.4/README.textile000066400000000000000000000206111446211165100160010ustar00rootroot00000000000000h1. Aggregate By Joseph Ruscio Aggregate is an intuitive ruby implementation of a statistics aggregator including both default and configurable histogram support. It does this without recording/storing any of the actual sample values, making it suitable for tracking statistics across millions/billions of sample without any impact on performance or memory footprint. Originally inspired by the Aggregate support in "SystemTap.":http://sourceware.org/systemtap h2. Getting Started Aggregates are easy to instantiate, populate with sample data, and then inspect for common aggregate statistics:

#After instantiation use the << operator to add a sample to the aggregate:
stats = Aggregate.new

loop do
  # Take some action that generates a sample measurement
  stats << sample
end

# The number of samples
stats.count

# The average
stats.mean

# Max sample value
stats.max

# Min sample value
stats.min

# The standard deviation
stats.std_dev
h2. Histograms Perhaps more importantly than the basic aggregate statistics detailed above Aggregate also maintains a histogram of samples. For anything other than normally distributed data are insufficient at best and often downright misleading 37Signals recently posted a terse but effective "explanation":http://37signals.com/svn/posts/1836-the-problem-with-averages of the importance of histograms. Aggregates maintains its histogram internally as a set of "buckets". Each bucket represents a range of possible sample values. The set of all buckets represents the range of "normal" sample values. h3. Binary Histograms Without any configuration Aggregate instance maintains a binary histogram, where each bucket represents a range twice as large as the preceding bucket i.e. [1,1], [2,3], [4,5,6,7], [8,9,10,11,12,13,14,15]. The default binary histogram provides for 128 buckets, theoretically covering the range [1, (2^127) - 1] (See NOTES below for a discussion on the effects in practice of insufficient precision.) Binary histograms are useful when we have little idea about what the sample distribution may look like as almost any positive value will fall into some bucket. After using binary histograms to determine the coarse-grained characteristics of your sample space you can configure a linear histogram to examine it in closer detail. h3. Linear Histograms Linear histograms are specified with the three values low, high, and width. Low and high specify a range [low, high) of values included in the histogram (all others are outliers). Width specifies the number of values represented by each bucket and therefore the number of buckets i.e. granularity of the histogram. The histogram range (high - low) must be a multiple of width:

#Want to track aggregate stats on response times in ms
response_stats = Aggregate.new(0, 2000, 50)
The example above creates a linear histogram that tracks the response times from 0 ms to 2000 ms in buckets of width 50 ms. Hopefully most of your samples fall in the first couple buckets! h3. Histogram Outliers An Aggregate records any samples that fall outside the histogram range as outliers:

# Number of samples that fall below the normal range
stats.outliers_low

# Number of samples that fall above the normal range
stats.outliers_high
h3. Histogram Iterators Once a histogram is populated Aggregate provides iterator support for examining the contents of buckets. The iterators provide both the number of samples in the bucket, as well as its range:

#Examine every bucket
@stats.each do |bucket, count|
end

#Examine only buckets containing samples
@stats.each_nonzero do |bucket, count|
end
h3. Histogram Bar Chart Finally Aggregate contains sophisticated pretty-printing support to generate ASCII bar charts. For any given number of columns >= 80 (defaults to 80) and sample distribution the to_s method properly sets a marker weight based on the samples per bucket and aligns all output. Empty buckets are skipped to conserve screen space.

# Generate and display an 80 column histogram
puts stats.to_s

# Generate and display a 120 column histogram
puts stats.to_s(120)
This code example populates both a binary and linear histogram with the same set of 65536 values generated by rand to produce the two histograms that follow it:

require 'rubygems'
require 'aggregate'

# Create an Aggregate instance
binary_aggregate = Aggregate.new
linear_aggregate = Aggregate.new(0, 65536, 8192)

65536.times do
  x = rand(65536)
  binary_aggregate << x
  linear_aggregate << x
end

puts binary_aggregate.to_s
puts linear_aggregate.to_s
h4. Binary Histogram

value |------------------------------------------------------------------| count
    1 |                                                                  |     3
    2 |                                                                  |     1
    4 |                                                                  |     5
    8 |                                                                  |     9
   16 |                                                                  |    15
   32 |                                                                  |    29
   64 |                                                                  |    62
  128 |                                                                  |   115
  256 |                                                                  |   267
  512 |@                                                                 |   523
 1024 |@                                                                 |   970
 2048 |@@@                                                               |  1987
 4096 |@@@@@@@@                                                          |  4075
 8192 |@@@@@@@@@@@@@@@@                                                  |  8108
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                                  | 16405
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| 32961
      ~
Total |------------------------------------------------------------------| 65535
h4. Linear (0, 65536, 4096) Histogram

value |------------------------------------------------------------------| count
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4094
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|  4202
 8192 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4118
12288 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |  4059
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    |  3999
20480 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4083
24576 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4134
28672 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |  4143
32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |  4152
36864 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |  4033
40960 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |  4064
45056 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |  4012
49152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   |  4070
53248 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4090
57344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |  4135
61440 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |  4144
Total |------------------------------------------------------------------| 65532
We can see from these histograms that Ruby's rand function does a relatively good job of distributing returned values in the requested range. h2. Examples Here's an example of a "handy timing benchmark":http://gist.github.com/187669 implemented with aggregate. h2. NOTES Ruby doesn't have a log2 function built into Math, so we approximate with log(x)/log(2). Theoretically log( 2^n - 1 )/ log(2) == n-1. Unfortunately due to precision limitations, once n reaches a certain size (somewhere > 32) this starts to return n. The larger the value of n, the more numbers i.e. (2^n - 2), (2^n - 3), etc fall trap to this errors. Could probably look into using something like BigDecimal, but for the current purposes of the binary histogram i.e. a simple coarse-grained view the current implementation is sufficient. aggregate-0.2.4/Rakefile000066400000000000000000000003141446211165100151070ustar00rootroot00000000000000require 'rake' require 'bundler/gem_tasks' require 'rake/testtask' Rake::TestTask.new do |t| t.libs << "test" t.test_files = FileList['test/ts_*.rb'] t.verbose = true end task :default => :test aggregate-0.2.4/aggregate.gemspec000066400000000000000000000022741446211165100167440ustar00rootroot00000000000000# coding: utf-8 lib = File.expand_path('../lib', __FILE__) $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib) require 'aggregate/version' Gem::Specification.new do |s| s.name = %q{aggregate} s.version = Aggregate::VERSION s.required_rubygems_version = Gem::Requirement.new(">= 0") if s.respond_to? :required_rubygems_version= s.authors = ["Joseph Ruscio"] s.description = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support. For a detailed README see: http://github.com/josephruscio/aggregate} s.email = %q{joe@ruscio.org} s.extra_rdoc_files = [ "LICENSE", "README.textile" ] s.files = Dir["{lib}/**/*.*", "LICENSE", "README.textile"] s.homepage = %q{http://github.com/josephruscio/aggregate} s.rdoc_options = ["--charset=UTF-8"] s.require_paths = ["lib"] s.license = "MIT" s.summary = %q{Aggregate is a Ruby class for accumulating aggregate statistics and includes histogram support} s.test_files = [ "test/ts_aggregate.rb" ] if s.respond_to? :specification_version then s.specification_version = 3 if Gem::Version.new(Gem::VERSION) >= Gem::Version.new('1.2.0') then else end else end end aggregate-0.2.4/lib/000077500000000000000000000000001446211165100142125ustar00rootroot00000000000000aggregate-0.2.4/lib/aggregate.rb000066400000000000000000000161731446211165100164750ustar00rootroot00000000000000# Implements aggregate statistics and maintains # configurable histogram for a set of given samples. Convenient for tracking # high throughput data. class Aggregate #The current number of samples attr_reader :count #The maximum sample value attr_reader :max #The minimum samples value attr_reader :min #The sum of all samples attr_reader :sum #The number of samples falling below the lowest valued histogram bucket attr_reader :outliers_low #The number of samples falling above the highest valued histogram bucket attr_reader :outliers_high # The number of buckets in the binary logarithmic histogram (low => 2**0, high => 2**@@LOG_BUCKETS) @@LOG_BUCKETS = 128 # Create a new Aggregate that maintains a binary logarithmic histogram # by default. Specifying values for low, high, and width configures # the aggregate to maintain a linear histogram with (high - low)/width buckets def initialize(low=nil, high=nil, width=nil) @count = 0 @sum = 0.0 @sum2 = 0.0 @outliers_low = 0 @outliers_high = 0 # If the user asks we maintain a linear histogram where # values in the range [low, high) are bucketed in multiples # of width if (nil != low && nil != high && nil != width) #Validate linear specification if high <= low raise ArgumentError, "High bucket must be > Low bucket" end if high - low < width raise ArgumentError, "Histogram width must be <= histogram range" end if 0 != (high - low).modulo(width) raise ArgumentError, "Histogram range (high - low) must be a multiple of width" end @low = low @high = high @width = width else @low = 1 @width = nil @high = to_bucket(@@LOG_BUCKETS - 1) end #Initialize all buckets to 0 @buckets = Array.new(bucket_count, 0) end # Include a sample in the aggregate def << data # Update min/max if 0 == @count @min = data @max = data else @max = data if data > @max @min = data if data < @min end # Update the running info @count += 1 @sum += data @sum2 += (data * data) # Update the bucket @buckets[to_index(data)] += 1 unless outlier?(data) end #The current average of all samples def mean @sum / @count end #Calculate the standard deviation def std_dev Math.sqrt((@sum2.to_f - ((@sum.to_f * @sum.to_f)/@count.to_f)) / (@count.to_f - 1)) end # Combine two aggregates #def +(b) # a = self # c = Aggregate.new # c.count = a.count + b.count #end #Generate a pretty-printed ASCII representation of the histogram def to_s(columns=nil) #default to an 80 column terminal, don't support < 80 for now if nil == columns columns = 80 else raise ArgumentError if columns < 80 end #Find the largest bucket and create an array of the rows we intend to print disp_buckets = Array.new max_count = 0 total = 0 @buckets.each_with_index do |count, idx| next if 0 == count max_count = [max_count, count].max disp_buckets << [idx, to_bucket(idx), count] total += count end #XXX: Better to print just header --> footer return "Empty histogram" if 0 == disp_buckets.length #Figure out how wide the value and count columns need to be based on their #largest respective numbers value_str = "value" count_str = "count" total_str = "Total" value_width = [disp_buckets.last[1].to_s.length, value_str.length].max value_width = [value_width, total_str.length].max count_width = [total.to_s.length, count_str.length].max max_bar_width = columns - (value_width + " |".length + "| ".length + count_width) #Determine the value of a '@' weight = [max_count.to_f/max_bar_width.to_f, 1.0].max #format the header histogram = sprintf("%#{value_width}s |", value_str) max_bar_width.times { histogram << "-"} histogram << sprintf("| %#{count_width}s\n", count_str) # We denote empty buckets with a '~' def skip_row(value_width) sprintf("%#{value_width}s ~\n", " ") end #Loop through each bucket to be displayed and output the correct number prev_index = disp_buckets[0][0] - 1 disp_buckets.each do |x| #Denote skipped empty buckets with a ~ histogram << skip_row(value_width) unless prev_index == x[0] - 1 prev_index = x[0] #Add the value row = sprintf("%#{value_width}d |", x[1]) #Add the bar bar_size = (x[2]/weight).to_i bar_size.times { row += "@"} (max_bar_width - bar_size).times { row += " " } #Add the count row << sprintf("| %#{count_width}d\n", x[2]) #Append the finished row onto the histogram histogram << row end #End the table histogram << skip_row(value_width) if disp_buckets.last[0] != bucket_count-1 histogram << sprintf("%#{value_width}s", "Total") histogram << " |" max_bar_width.times {histogram << "-"} histogram << "| " histogram << sprintf("%#{count_width}d\n", total) end #Iterate through each bucket in the histogram regardless of #its contents def each @buckets.each_with_index do |count, index| yield(to_bucket(index), count) end end #Iterate through only the buckets in the histogram that contain #samples def each_nonzero @buckets.each_with_index do |count, index| yield(to_bucket(index), count) if count != 0 end end private def linear? nil != @width end def outlier?(data) if data < @low @outliers_low += 1 elsif data >= @high @outliers_high += 1 else return false end end def bucket_count if linear? return (@high-@low)/@width else return @@LOG_BUCKETS end end def to_bucket(index) if linear? return @low + (index * @width) else return 2**(index) end end def right_bucket?(index, data) # check invariant raise unless linear? bucket = to_bucket(index) #It's the right bucket if data falls between bucket and next bucket bucket <= data && data < bucket + @width end =begin def find_bucket(lower, upper, target) #Classic binary search return upper if right_bucket?(upper, target) # Cut the search range in half middle = (upper/2).to_i # Determine which half contains our value and recurse if (to_bucket(middle) >= target) return find_bucket(lower, middle, target) else return find_bucket(middle, upper, target) end end =end # A data point is added to the bucket[n] where the data point # is less than the value represented by bucket[n], but greater # than the value represented by bucket[n+1] def to_index(data) # basic case is simple return log2(data).to_i if !linear? # Search for the right bucket in the linear case @buckets.each_with_index do |count, idx| return idx if right_bucket?(idx, data) end #find_bucket(0, bucket_count-1, data) #Should not get here raise "#{data}" end # log2(x) returns j, | i = j-1 and 2**i <= data < 2**j @@LOG2_DIVEDEND = Math.log(2) def log2( x ) Math.log(x) / @@LOG2_DIVEDEND end end require_relative 'aggregate/version'aggregate-0.2.4/lib/aggregate/000077500000000000000000000000001446211165100161405ustar00rootroot00000000000000aggregate-0.2.4/lib/aggregate/version.rb000066400000000000000000000000471446211165100201530ustar00rootroot00000000000000class Aggregate VERSION = "0.2.4" endaggregate-0.2.4/test/000077500000000000000000000000001446211165100144235ustar00rootroot00000000000000aggregate-0.2.4/test/ts_aggregate.rb000066400000000000000000000072271446211165100174140ustar00rootroot00000000000000require 'minitest/autorun' require 'aggregate' class SimpleStatsTest < MiniTest::Test def setup @stats = Aggregate.new @@DATA.each do |x| @stats << x end end def test_stats_count assert_equal @@DATA.length, @stats.count end def test_stats_min_max sorted_data = @@DATA.sort assert_equal sorted_data[0], @stats.min assert_equal sorted_data.last, @stats.max end def test_stats_mean sum = 0 @@DATA.each do |x| sum += x end assert_equal sum.to_f/@@DATA.length.to_f, @stats.mean end def test_bucket_counts #Test each iterator total_bucket_sum = 0 i = 0 @stats.each do |bucket, count| assert_equal 2**i, bucket total_bucket_sum += count i += 1 end assert_equal total_bucket_sum, @@DATA.length #Test each_nonzero iterator prev_bucket = 0 total_bucket_sum = 0 @stats.each_nonzero do |bucket, count| assert bucket > prev_bucket refute_equal count, 0 total_bucket_sum += count end assert_equal total_bucket_sum, @@DATA.length end =begin def test_addition stats1 = Aggregate.new stats2 = Aggregate.new stats1 << 1 stats2 << 3 stats_sum = stats1 + stats2 assert_equal stats_sum.count, stats1.count + stats2.count end =end #XXX: Update test_bucket_contents() if you muck with @@DATA @@DATA = [ 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383] def test_bucket_contents #XXX: This is the only test so far that cares about the actual contents # of @@DATA, so if you update that array ... update this method too expected_buckets = [1, 4, 1024, 8192, 16384] expected_counts = [1, 3, 2, 1, 2] i = 0 @stats.each_nonzero do |bucket, count| assert_equal expected_buckets[i], bucket assert_equal expected_counts[i], count # Increment for the next test i += 1 end end def test_histogram puts @stats.to_s end def test_outlier assert_equal 0, @stats.outliers_low assert_equal 0, @stats.outliers_high @stats << -1 @stats << -2 @stats << 0 @stats << 2**128 # This should be the last value in the last bucket, but Ruby's native # floats are not precise enough. Somewhere past 2^32 the log(x)/log(2) # breaks down. So it shows up as 128 (outlier) instead of 127 #@stats << (2**128) - 1 assert_equal 3, @stats.outliers_low assert_equal 1, @stats.outliers_high end def test_std_dev @stats.std_dev end end class LinearHistogramTest < MiniTest::Test def setup @stats = Aggregate.new(0, 32768, 1024) @@DATA.each do |x| @stats << x end end def test_validation # Range cannot be 0 assert_raises(ArgumentError) { Aggregate.new(32,32,4) } # Range cannot be negative assert_raises(ArgumentError) { Aggregate.new(32,16,4) } # Range cannot be < single bucket assert_raises(ArgumentError) { Aggregate.new(16,32,17) } # Range % width must equal 0 (for now) assert_raises(ArgumentError) { Aggregate.new(1,16384,1024) } end #XXX: Update test_bucket_contents() if you muck with @@DATA # 32768 is an outlier @@DATA = [ 0, 1, 5, 4, 6, 1028, 1972, 16384, 16385, 16383, 32768] def test_bucket_contents #XXX: This is the only test so far that cares about the actual contents # of @@DATA, so if you update that array ... update this method too expected_buckets = [0, 1024, 15360, 16384] expected_counts = [5, 2, 1, 2] i = 0 @stats.each_nonzero do |bucket, count| assert_equal expected_buckets[i], bucket assert_equal expected_counts[i], count # Increment for the next test i += 1 end end end