Jump to content

Draft:Random Group Formation Distribution

From Wikipedia, the free encyclopedia


In probability theory and statistics, the Random Group Formation distribution or RGF distribution is a heavy-tailed distribution and fat-tailed distribution. It is the distribution of the number of individuals in each group, based on N individuals being put into M groups.[1]

Definition

[edit]

Many real-world samples seem to follow a power-law distribution and come from N individuals being placed into M groups. An example is the number of people in each county. Another is number of people working at each company. Another, in a different area, is all the words in a document grouped by the word. (E.g., the word "the" occurs in 10 places, "of" occurs in 5 places, etc.)

While many other fat-tailed distributions have been used to fit these data samples, the RGF distribution tries to fit them by defining the information in the grouping and choosing the minimum information cost distribution.

Baek, Bernhardsson, and Minnhagen define the information in the grouping as:[1]

where ranges over the number of members in a group, is the probability of an individual being in a group of size , is the natural log, and is the number of groups with members.

The resulting distribution is:

where and are constants gotten by solving a Lagrangian equation with particular and .

This distribution does not fit all real world samples. Baek, Bernhardsson, and Minnhagen generalize the definition by allowing some ordering to the grouping. That is realized with function that computes the discounted entropy as a function of the distribution. In practice, this doesn't need to be calculated. The size of the largest group is sufficient to fit the discounted entropy. That distribution is:

[edit]

The RGF is a maximum entropy distribution. Other ones include the normal distribution (when the mean and variance is known), the exponential distribution and Laplace distribution.

Matt Visser created a similar distribution.[2] It is a maximum entropy distribution that generates a power law distribution, with a simpler constraint: .

Applications

[edit]

The RGF distribution is not flat in a log-log graph like a power-law distributions. Data presented by Baek, Bernhardsson, and Minnhagen show that the curved RGF distribution matchs certain real-world samples better than the flat power-law distributions.[1]


References

[edit]
  1. ^ a b c Baek, Seung Ki; Bernhardsson, Sebastian; Minnhagen, Petter (7 April 2011). "Zipf's Law Unzipped" (PDF). New Journal of Physics. 2011 (13).
  2. ^ Visser, Matt (2013). "Zipf's law, power laws, and maximum entropy". New Journal of Physics. 2013 (15).