racket/collects/math/scribblings/math-statistics.scrbl

#lang scribble/manual

@(require scribble/eval
          racket/sandbox
          (for-label racket/base racket/promise racket/list
                     math plot
                     (only-in typed/racket/base
                              ann inst
                              Flonum Real Boolean Any Listof Integer case-> -> U
                              Sequenceof Positive-Flonum Nonnegative-Flonum Symbol
                              HashTable Positive-Integer Nonnegative-Real Values
                              String))
          "utils.rkt")

@(define typed-eval (make-math-eval))
@interaction-eval[#:eval typed-eval (require)]

@title[#:tag "stats"]{Statistics Functions}
@(author-neil)

@defmodule[math/statistics]

This module exports functions that compute @deftech{statistics}, meaning summary values for
collections of samples, and functions for managing sequences of weighted or unweighted samples.

Most of the functions that compute statistics accept a sequence of nonnegative reals that
correspond one-to-one with sample values.
These are used as weights; equivalently counts, pseudocounts or unnormalized probabilities.
While this makes it easy to work with weighted samples, it introduces some subtleties
in bias correction.
In particular, central moments must be computed without bias correction by default.
See @secref{stats:expected-values} for a discussion.

@local-table-of-contents[]

@section[#:tag "stats:expected-values"]{Expected Values}

Functions documented in this section that compute higher central moments, such as @racket[variance],
@racket[stddev] and @racket[skewness], can optionally apply bias correction to their estimates.
For example, when @racket[variance] is given the argument @racket[#:bias #t], it
multiplies the result by @racket[(/ n (- n 1))], where @racket[n] is the number of samples.

The meaning of ``bias correction'' becomes less clear with weighted samples, however. Often, the
weights represent counts, so when moment-estimating functions receive @racket[#:bias #t], they
interpret it as ``use the sum of @racket[ws] for @racket[n].''
In the following example, the sample @racket[4] is first counted twice and then given weight
@racket[2]; therefore @racket[n = 5] in both cases:
@interaction[#:eval typed-eval
                    (variance '(1 2 3 4 4) #:bias #t)
                    (variance '(1 2 3 4) '(1 1 1 2) #:bias #t)]

However, sample weights often do not represent counts. For these cases, the @racket[#:bias]
keyword can be followed by a real-valued pseudocount, which is used for @racket[n]:
@interaction[#:eval typed-eval
                    (variance '(1 2 3 4) '(1/2 1/2 1/2 1) #:bias 5)]

Because the magnitude of the bias correction for weighted samples cannot be known without user
guidance, in all cases, the bias argument defaults to @racket[#f].

@defproc[(mean [xs (Sequenceof Real)] [ws (U #f (Sequenceof Real)) #f]) Real]{
When @racket[ws] is @racket[#f] (the default), returns the sample mean of the values in @racket[xs].
Otherwise, returns the weighted sample mean of the values in @racket[xs] with corresponding weights
@racket[ws].
@examples[#:eval typed-eval
                 (mean '(1 2 3 4 5))
                 (mean '(1 2 3 4 5) '(1 1 1 1 10.0))
                 (define d (normal-dist))
                 (mean (sample d 10000))
                 (define arr (array-strict (build-array #(5 1000) (λ (_) (sample d)))))
                 (array-map mean (array->list-array arr 1))]
}

@deftogether[(@defproc[(variance [xs (Sequenceof Real)]
                                 [ws (U #f (Sequenceof Real)) #f]
                                 [#:bias bias (U #t #f Real) #f])
                       Nonnegative-Real]
              @defproc[(stddev [xs (Sequenceof Real)]
                               [ws (U #f (Sequenceof Real)) #f]
                               [#:bias bias (U #t #f Real) #f])
                       Nonnegative-Real]
              @defproc[(skewness [xs (Sequenceof Real)]
                                 [ws (U #f (Sequenceof Real)) #f]
                                 [#:bias bias (U #t #f Real) #f])
                       Real]
              @defproc[(kurtosis [xs (Sequenceof Real)]
                                 [ws (U #f (Sequenceof Real)) #f]
                                 [#:bias bias (U #t #f Real) #f])
                       Nonnegative-Real])]{
If @racket[ws] is @racket[#f], these compute the sample variance, standard deviation, skewness
and excess kurtosis the samples in @racket[xs].
If @racket[ws] is not @racket[#f], they compute weighted variations of the same.
@examples[#:eval typed-eval
                 (stddev '(1 2 3 4 5))
                 (stddev '(1 2 3 4 5) '(1 1 1 1 10))]
See @secref{stats:expected-values} for the meaning of the @racket[bias] keyword argument.
}

@deftogether[(@defproc[(variance/mean [mean Real]
                                      [xs (Sequenceof Real)]
                                      [ws (U #f (Sequenceof Real)) #f]
                                      [#:bias bias (U #t #f Real) #f])
                       Nonnegative-Real]
              @defproc[(stddev/mean [mean Real]
                                    [xs (Sequenceof Real)]
                                    [ws (U #f (Sequenceof Real)) #f]
                                    [#:bias bias (U #t #f Real) #f])
                       Nonnegative-Real]
              @defproc[(skewness/mean [mean Real]
                                      [xs (Sequenceof Real)]
                                      [ws (U #f (Sequenceof Real)) #f]
                                      [#:bias bias (U #t #f Real) #f])
                       Real]
              @defproc[(kurtosis/mean [mean Real]
                                      [xs (Sequenceof Real)]
                                      [ws (U #f (Sequenceof Real)) #f]
                                      [#:bias bias (U #t #f Real) #f])
                       Nonnegative-Real])]{
Like @racket[variance], @racket[stddev], @racket[skewness] and @racket[kurtosis], but computed
using known mean @racket[mean].
}

@section[#:tag "stats:running"]{Running Expected Values}

The @racket[statistics] object allows computing the sample minimum, maximum, count, mean, variance,
skewness, and excess kurtosis of a sequence of samples in O(1) space.

To use it, start with @racket[empty-statistics], then use @racket[update-statistics] to obtain a
new statistics object with updated values. Use @racket[statistics-min], @racket[statistics-mean],
and similar functions to get the current estimates.
@examples[#:eval typed-eval
                 (let* ([s  empty-statistics]
                        [s  (update-statistics s 1)]
                        [s  (update-statistics s 2)]
                        [s  (update-statistics s 3)]
                        [s  (update-statistics s 4 2)])
                   (values (statistics-mean s)
                           (statistics-stddev s #:bias #t)))]

@defstruct*[statistics ([min Flonum]
                        [max Flonum]
                        [count Nonnegative-Flonum])]{
Represents running statistics.

The @racket[min] and @racket[max] fields are the minimum and maximum
value observed so far, and the @racket[count] field is the total weight of the samples (which is the
number of samples if all samples are unweighted).
The remaining, hidden fields are used to compute moments, and their number and meaning may change in
future releases.
}

@defthing[empty-statistics statistics]{
The empty statistics object.
@examples[#:eval typed-eval
                 (statistics-min empty-statistics)
                 (statistics-max empty-statistics)
                 (statistics-range empty-statistics)
                 (statistics-count empty-statistics)
                 (statistics-mean empty-statistics)
                 (statistics-variance empty-statistics)
                 (statistics-skewness empty-statistics)
                 (statistics-kurtosis empty-statistics)]
}

@defproc[(update-statistics [s statistics] [x Real] [w Real 1.0]) statistics]{
Returns a new statistics object that includes @racket[x] in the computed statistics. If @racket[w]
is given, @racket[x] is weighted by @racket[w] in the moment computations.
}

@defproc[(update-statistics* [s statistics]
                             [xs (Sequenceof Real)]
                             [ws (U #f (Sequenceof Real)) #f])
         statistics]{
Like @racket[update-statistics], but includes all of @racket[xs], possibly weighted by corresponding
elements in @racket[ws], in the returned statistics object.
@examples[#:eval typed-eval
                 (define s (update-statistics* empty-statistics '(1 2 3 4) '(1 1 1 2)))
                 (statistics-mean s)
                 (statistics-stddev s #:bias #t)]
This function uses O(1) space regardless of the length of @racket[xs].
}

@deftogether[(@defproc[(statistics-range [s statistics]) Nonnegative-Flonum]
              @defproc[(statistics-mean [s statistics]) Flonum]
              @defproc[(statistics-variance [s statistics] [#:bias bias (U #t #f Real) #f])
                       Nonnegative-Flonum]
              @defproc[(statistics-stddev [s statistics] [#:bias bias (U #t #f Real) #f])
                       Nonnegative-Flonum]
              @defproc[(statistics-skewness [s statistics] [#:bias bias (U #t #f Real) #f])
                       Flonum]
              @defproc[(statistics-kurtosis [s statistics] [#:bias bias (U #t #f Real) #f])
                       Nonnegative-Flonum])]{
Compute the range, mean, variance, standard deviation, skewness, and excess kurtosis of the
observations summarized in @racket[s].

See @secref{stats:expected-values} for the meaning of the @racket[bias] keyword argument.
}

@section{Correlation}

@deftogether[(@defproc[(covariance [xs (Sequenceof Real)]
                                   [ys (Sequenceof Real)]
                                   [ws (U #f (Sequenceof Real)) #f]
                                   [#:bias bias (U #t #f Real) #f])
                       Real]
              @defproc[(correlation [xs (Sequenceof Real)]
                                    [ys (Sequenceof Real)]
                                    [ws (U #f (Sequenceof Real)) #f]
                                    [#:bias bias (U #t #f Real) #f])
                       Real])]{
Compute the sample covariance and correlation of @racket[xs] and @racket[ys], optionally
weighted by @racket[ws].

@examples[#:eval typed-eval
                 (define xs (sample (normal-dist) 10000))
                 (define ys (map (λ: ([x : Real]) (sample (normal-dist x))) xs))
                 (correlation xs ys)]
Removing the correlation using importance weights:
@interaction[#:eval typed-eval
                    (define ws (map (λ: ([x : Real] [y : Real])
                                      (/ (pdf (normal-dist) y)
                                         (pdf (normal-dist x) y)))
                                    xs ys))
                    (correlation xs ys (ann ws (Sequenceof Real)))]

See @secref{stats:expected-values} for the meaning of the @racket[bias] keyword argument.
}

@deftogether[(@defproc[(covariance/means [μx Real]
                                         [μy Real]
                                         [xs (Sequenceof Real)]
                                         [ys (Sequenceof Real)]
                                         [ws (U #f (Sequenceof Real)) #f]
                                         [#:bias bias (U #t #f Real) #f])
                       Real]
              @defproc[(correlation/means [μx Real]
                                          [μy Real]
                                          [xs (Sequenceof Real)]
                                          [ys (Sequenceof Real)]
                                          [ws (U #f (Sequenceof Real)) #f]
                                          [#:bias bias (U #t #f Real) #f])
                       Real])]{
Like @racket[covariance] and @racket[correlation], but computed using known means
@racket[μx] and @racket[μy].
}

@section{Counting and Binning}

@defproc*[([(samples->hash [xs (Sequenceof A)]) (HashTable A Positive-Integer)]
           [(samples->hash [xs (Sequenceof A)] [ws (U #f (Sequenceof Real))])
            (HashTable A Nonnegative-Real)])]{
Returns a hash table mapping each unique element in @racket[xs] (under @racket[equal?]) to its
count, or, if @racket[ws] is not @racket[#f], to its total weight.
@examples[#:eval typed-eval
                 (samples->hash '(1 2 3 4 4))
                 (samples->hash '(1 1 2 3 4) '(1/2 1/2 1 1 2))]
}

@defproc*[([(count-samples [xs (Sequenceof A)]) (Values (Listof A) (Listof Positive-Integer))]
           [(count-samples [xs (Sequenceof A)] [ws (U #f (Sequenceof Real))])
            (Values (Listof A) (Listof Nonnegative-Real))])]{
Like @racket[samples->hash], but returns two lists.
The elements in the returned @racket[(Listof A)] are in order of first appearance in @racket[xs].
@examples[#:eval typed-eval
                 (count-samples '(1 2 3 4 4))
                 (count-samples '(1 1 2 3 4) '(1/2 1/2 1 1 2))]
}

@defstruct*[sample-bin ([min B]
                        [max B]
                        [values (Listof A)]
                        [weights (U #f (Listof Nonnegative-Real))])]{
Represents a @italic{bin}, or a group of samples within an interval in a total order.
The values and bounds have a different type to allow @racket[bin-samples/key]
to group elements based on a function of their values.
}

@defproc[(bin-samples [bounds (Sequenceof A)]
                      [lte? (A A -> Any)]
                      [xs (Sequenceof A)]
                      [ws (U #f (Sequenceof Real))])
         (Listof (sample-bin A A))]{
Similar to @racket[(sort xs lte?)], but additionally groups samples into bins.
The bins' @racket[bounds] are sorted before binning @racket[xs].

If @racket[n = (length bounds)], then @racket[bin-samples] returns @italic{at least} @racket[(- n 1)]
bins, one for each pair of adjacent (sorted) bounds.
If some values in @racket[xs] are less than the smallest bound, they are grouped into a single bin in
front.
If some are greater than the largest bound, they are grouped into a single bin at the end.

@examples[#:eval typed-eval
                 (bin-samples '() <= '(0 1 2 3 4 5 6))
                 (bin-samples '(3) <= '(0 1 2 3 4 5 6))
                 (bin-samples '(2 4) <= '(0 1 2 3 4 5 6))
                 (bin-samples '(2 4) <=
                              '(0 1 2 3 4 5 6)
                              '(10 20 30 40 50 60 70))]

If @racket[lte?] is a less-than-or-equal relation, the bins represent half-open intervals
(@racket[min], @racket[max]] (except possibly the first, which may be closed).
If @racket[lte?] is a less-than relation, the bins represent half-open intervals
[@racket[min], @racket[max]) (except possibly the last, which may be closed).
In either case, the sorts applied to @racket[bounds] and @racket[xs] are stable.

Because intervals used in probability measurements are normally open on the left, prefer to use
less-than-or-equal relations for @racket[lte?].

If @racket[ws] is @racket[#f], @racket[bin-samples] returns bins with @racket[#f] weights.
}

@defproc[(bin-samples/key [bounds (Sequenceof B)]
                          [lte? (B B -> Any)]
                          [key (A -> B)]
                          [xs (Sequenceof A)]
                          [ws (U #f (Sequenceof Real))])
         (Listof (sample-bin A B))]{
Similar to @racket[(sort xs lte? #:key key #:cache-keys? #t)], but additionally groups samples into
bins.
@examples[#:eval typed-eval
                 (bin-samples/key '(2 4) <= (inst car Real String)
                                  '((1 . "1") (2 . "2") (3 . "3") (4 . "4") (5 . "5")))]
See @racket[bin-samples] for the simpler, one-type variant.
}

@defproc[(sample-bin-compact [bin (sample-bin A B)]) (sample-bin A B)]{
Compacts @racket[bin] by applying @racket[count-samples] to its values and weights.
@examples[#:eval typed-eval
                 (sample-bin-compact (sample-bin 1 4 '(1 2 3 4 4) #f))]
}

@defproc[(sample-bin-total [bin (sample-bin A B)]) Nonnegative-Real]{
If @racket[(sample-bin-weights bin)] is @racket[#f], returns the number of samples in @racket[bin];
otherwise, returns the sum of their weights.
@examples[#:eval typed-eval
                 (sample-bin-total (sample-bin 1 4 '(1 2 3 4 4) #f))
                 (sample-bin-total (sample-bin-compact (sample-bin 1 4 '(1 2 3 4 4) #f)))]
}

@section{Order Statistics}

@defproc*[([(sort-samples [lt? (A A -> Any)] [xs (Sequenceof A)]) (Listof A)]
           [(sort-samples [lt? (A A -> Any)]
                          [xs (Sequenceof A)]
                          [ws (U #f (Sequenceof Real))])
            (Values (Listof A) (Listof Nonnegative-Real))])]{
Sorts possibly weighted samples according to @racket[lt?], which is assumed to define a total
order over the elements in @racket[xs].
@examples[#:eval typed-eval
                 (sort-samples < '(5 2 3 1))
                 (sort-samples < '(5 2 3 1) '(50 20 30 10))
                 (sort-samples < '(5 2 3 1) #f)]
Because @racket[sort-samples] is defined in terms of @racket[sort], the sort is only guaranteed
to be stable if @racket[lt?] is strictly a less-than relation.
}

@defproc[(median [lt? (A A -> Any)] [xs (Sequenceof A)] [ws (U #f (Sequenceof Real)) #f]) A]{
Equivalent to @racket[(quantile 1/2 lt? xs ws)].
}

@defproc[(quantile [p Real]
                   [lt? (A A -> Any)]
                   [xs (Sequenceof A)]
                   [ws (U #f (Sequenceof Real)) #f])
         A]{
Computes the inverse of the empirical @tech{cdf} represented by the samples @racket[xs],
which are optionally weighted by @racket[ws].

@examples[#:eval typed-eval
                 (quantile 0 < '(1 3 5))
                 (quantile 0.5 < '(1 2 3 4))
                 (quantile 0.5 < '(1 2 3 4) '(0.25 0.20 0.20 0.35))]

If @racket[p = 0], @racket[quantile] returns the smallest element of @racket[xs] under the
ordering relation @racket[lt?]. If @racket[p = 1], it returns the largest element.

For weighted samples, @racket[quantile] sorts @racket[xs] and @racket[ws] together
(using @racket[sort-samples]), then finds the least @racket[x] for which the proportion of its
cumulative weight is greater than or equal to @racket[p].

For unweighted samples, @racket[quantile] uses the quickselect algorithm to find the element that
would be at index @racket[(ceiling (- (* p n) 1))] if @racket[xs] were sorted, where @racket[n]
is the length of @racket[xs].
}

@defproc[(absdev [xs (Sequenceof Real)] [ws (U #f (Sequenceof Real)) #f]) Nonnegative-Real]{
Computes the average absolute difference between the elements in @racket[xs] and
@racket[(median < xs ws)]. If @racket[ws] is not @racket[#f], it is a weighted average.
}

@defproc[(absdev/median [median Real] [xs (Sequenceof Real)] [ws (U #f (Sequenceof Real)) #f])
         Nonnegative-Real]{
Like @racket[(absdev xs ws)], but computed using known median @racket[median].
}

@(close-eval typed-eval)