Finding Mean, Median, and Mode

Problem

You want to find the average of an array of numbers: its mean, median, or mode.

Solution

Usually when people speak of the "average" of a set of numbers they're referring to its mean, or arithmetic mean. The mean is the sum of the elements divided by the number of elements.

	def mean(array)
	 array.inject(array.inject(0) { |sum, x| sum += x } / array.size.to_f
	end

	mean([1,2,3,4]) # => 2.5
	mean([100,100,100,100.1]) # => 100.025
	mean([-100, 100]) # => 0.0
	mean([3,3,3,3]) # => 3.00

The median is the item x such that half the items in the array are greater than x and the other half are less than x. Consider a sorted array: if it contains an odd number of elements, the median is the one in the middle. If the array contains an even number of elements, the median is defined as the mean of the two middle elements.

	def median(array, already_sorted=false)
	 return nil if array.empty?
	 array = array.sort unless already_sorted
	 m_pos = array.size / 2
	 return array.size % 2 == 1 ? array[m_pos] : mean(array[m_pos-1..m_pos])
	end

	median([1,2,3,4,5]) # => 3
	median([5,3,2,1,4]) # => 3
	median([1,2,3,4]) # => 2.5
	median([1,1,2,3,4]) # => 2
	median([2,3,-100,100]) # => 2.5
	median([1, 1, 10, 100, 1000]) # => 10

The mode is the single most popular item in the array. If a list contains no repeated items, it is not considered to have a mode. If an array contains multiple items at the maximum frequency, it is "multimodal." Depending on your application, you might handle each mode separately, or you might just pick one arbitrarily.

	def modes(array, find_all=true)
	 histogram = array.inject(Hash.new(0)) { |h, n| h[n] += 1; h }
	 modes = nil
	 histogram.each_pair do |item, times|
	 modes << item if modes && times == modes[0] and find_all
	 modes = [times, item] if (!modes && times>1) or (modes && times>modes[0])
	 end
	 return modes ? modes[1…modes.size] : modes
	end

	modes([1,2,3,4]) # => nil
	modes([1,1,2,3,4]) # => [1]
	modes([1,1,2,2,3,4]) # => [1, 2]
	modes([1,1,2,2,3,4,4]) # => [1, 2, 4]
	modes([1,1,2,2,3,4,4], false) # => [1]
	modes([1,1,2,2,3,4,4,4,4,4]) # => [4]

 

Discussion

The mean is the most popular type of average. It's simple to calculate and to understand. The implementation of mean given above always returns a floating-point number object. It's a good general-purpose implementation because it lets you pass in an array of Fixnums and get a fractional average, instead of one rounded to the nearest integer. If you want to find the mean of an array of BigDecimal or Rational objects, you should use an implementation of mean that omits the final to_f call:

	def mean_without_float_conversion(array)
	 array.inject(0) { |x, sum| sum += x } / array.size
	end
	require 'rational' 
	numbers = [Rational(2,3), Rational(3,4), Rational(6,7)]
	mean(numbers)
	# => 0.757936507936508
	mean_without_float_conversion(numbers) 
	# => Rational(191, 252) 

The median is mainly useful when a small proportion of outliers in the dataset would make the mean misleading. For instance, government statistics usually show "median household income" instead of "mean household income." Otherwise, a few super-wealthy households would make everyone else look much richer than they are. The example below demonstrates how the mean can be skewed by a few very high or very low outliers.

	mean([1, 100, 100000]) # => 33367.0
	median([1, 100, 100000]) # => 100

	mean([1, 100, -1000000]) # => -333299.666666667
	median([1, 100, -1000000]) # => 1

The mode is the only definition of "average" that can be applied to arrays of arbitrary objects. Since the mean is calculated using arithmetic, an array can only be said to have a mean if all of its members are numeric. The median involves only comparisons, except when the array contains an even number of elements: then, calculating the median requires that you calculate the mean.

If you defined some other way to take the median of an array with an even number of elements, you could take the median of Arrays of strings:

	median(["a", "z", "b", "l", "m", "j", "b"])
	# => "j"
	median(["a", "b", "c", "d"])
	# TypeError: String can't be coerced into Fixnum

 

The standard deviation

A concept related to the mean is the standard deviation, a quantity that measures how close the dataset as a whole is to the mean. When a mean is distorted by high or low outliers, the corresponding standard deviation is high. When the numbers in a dataset cluster closely around the mean, the standard deviation is low. You won't be fooled by a misleading mean if you also look at the standard deviation.

	def mean_and_standard_deviation(array)
	 m = mean(array)
	 variance = array.inject(0) { |variance, x| variance += (x - m) ** 2 }
	 return m, Math.sqrt(variance/(array.size-1))
	end

	#All the items in the list are close to the mean, so the standard
	#deviation is low. 
	mean_and_standard_deviation([1,2,3,1,1,2,1])
	# => [1.57142857142857, 0.786795792469443]
	#The outlier increases the mean, but also increases the standard deviation.
	mean_and_standard_deviation([1,2,3,1,1,2,1000])
	# => [144.285714285714, 377.33526837801]

A good rule of thumb is that two-thirds (about 68 percent) of the items in a dataset are within one standard deviation of the mean, and almost all (about 95 percent) of the items are within two standard deviations of the mean.

See Also

  • "Programmers Need to Learn Statistics or I Will Kill Them All," by Zed Shaw (http://www.zedshaw.com/blog/programming/programmer_stats.html)
  • More Ruby implementations of simple statistical measures (http://dada.perl.it/shootout/moments.ruby.html)
  • To do more complex statistical analysis in Ruby, try the Ruby bindings to the GNU Scientific Library (http://ruby-gsl.sourceforge.net/)
  • The Stats class in the Mongrel web server (http://mongrel.rubyforge.org) implements other algorithms for calculating mean and standard deviation, which are faster if you need to repeatedly calculate the mean of a growing series


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net