Jump to content

Hilberg's Hypothesis

From Wikipedia, the free encyclopedia
(Redirected from Draft:Hilberg's hypothesis)

Hilberg's Hypothesis (also known as Hilberg's Law) is a conjecture in quantitative linguistics or a statistical law in information theory. It states that measures of information in natural language texts or samples of particular stochastic processes grow as a sublinear power of the sample length, possibly in addition to the term that is linear. It is closely related to Zipf's law and the neural scaling law for large language models.

Overview

[edit]

Hilberg's hypothesis was proposed by the German telecommunication engineer Wolfgang Hilberg in 1990,[1] based on data originally published by Claude Shannon in 1951 on the predictability of English text.[2] Hilberg observed that the amount of new information per character appears to decrease with context length in a manner consistent with a power law. His analysis implied that the Shannon entropy of text blocks of length grows approximately as

where parameter is the entropy rate of the process and parameter is called the Hilberg exponent.[3] The term proportional to represents a memory effect, suggesting that human language carries a large amount of information in a repetitive way. Hilberg originally assumed that and basing his hypothesis on a visually meagre evidence.[1] Hilberg's article was published in a local telecommunication journal and was initially noticed by physicists working in complex systems.[4][5][6][7]

Hilberg's hypothesis implies that written or spoken language exhibits potentially infinite memory or long-range statistical dependencies that cannot be captured by finite-state hidden Markov models. The refutation of finite-state models based on Hilberg's hypothesis is close in spirit to Zipf's law for the unbounded arbitrary lexicon, which needs to be learned and memorized by heart.[8] This mechanism differs to the refutation of finite-state models that is based on the unbounded stack of the push-down automaton for processing syntax of natural language, the latter argument developed by Noam Chomsky in Syntactic Structures in 1957.

By contrast to "Hilberg's hypothesis", the term "Hilberg's law" or "Hilberg's condition" often applies to any stationary process that satisfies a power-law scaling of .[8] Quantity may represent various information measures, such as Shannon entropy, (resource-bounded) Kolmogorov complexity,[9] or the length of a universal code.[10] In the context of deep learning, the term neural scaling law is used for analogous power-law relations describing how performance of a large language model, measured by cross entropy, improves with data size, model parameters, or computation.[11] Yet another expression involves the mutual information between two adjacent text blocks of length ,

Using this concept, Hilberg's law is equivalent to

This version does not depend on the precise value of the entropy rate and is used in theoretical studies.[12]

The value of the Hilberg exponent depends crucially on the applied information measure, or the compression algorithm in case of the universal code. Simultaneously, it exhibits a certain degree of universality across particular languages and writing systems, being for the prediction by partial matching code run on English, French, Russian, Korean, Chinese, and Japanese news corpora.[10][13]

Examples of processes

[edit]

The intuitive meaning of Hilberg's hypothesis can be motivated by some toy models of stationary processes, which link Hilberg's law with Zipf's law for an unbounded arbitrary lexicon that needs to be learned by heart. Equivalently, the following examples may be considered as idealized information sources that convey an unbounded amount of algorithmically random knowledge in a repetitive fashion.[12] This knowledge can be either immutable, giving rise to non-ergodic or perigraphic processes,[14][15] or slowly evolving, yielding mixing processes.[16] A priori, this knowledge may encode also facts different to the knowledge of a particular lexicon.[14]

A simple model of a stochastic process that satisfies Hilberg's law is given by the Santa Fe process. The Santa Fe process is a stationary non-ergodic process that consists of pairs , where is a sequence of independent random natural numbers distributed according to Zipf's law and is the uniform Bernoulli process, i.e., a sequence of fair coin flips.[14][17] Process is called narration and process is called knowledge.[8] Individual variables are called (independent elementary) facts.[14]

In general, Hilberg's law may arise for non-ergodic or perigraphic processes depending whether one considers Shannon entropy or Kolmogorov complexity as the chosen information measure. A perigraphic process is roughly an algorithmically random ergodic component of a non-ergodic process with a non-atomic invariant sigma-algebra. An example of a perigraphic process is a modified Santa Fe process that consists of pairs where is a fixed algorithmically random binary sequence such as the halting probability.[15]

Hilberg's law may also arise for some mixing processes. An example of such a process is a modified Santa Fe process that consists of pairs where is a collection of independent binary Markov processes, where the flipping probabilities are smaller than the mentioning probabilities .[16]

[edit]

Hilberg's hypothesis connects quantitative linguistics, information theory, and modern machine-learning research on scaling behavior. By means of mathematical theorems, Hilberg's law can be related to some similar conditions for stationary processes over a finite, a countably infinite, or an uncountable set of possible outcomes.

Hilberg's law can be deductively linked to Zipf's law and Heaps' law for word-like subwords or chunks by means of grammar-based codes and prediction by partial matching, non-ergodic processes, and the data-processing inequality for mutual information.[14][15][12] There is a general result concerning stationary processes, called the theorem about facts and words, which states that the mutual information between two adjacent text blocks is sandwiched by the number of independent elementary facts and the number of distinct word-like strings expressed in the concatenated text block.[14][15][12]

Deductive links between Hilberg's law and the neural scaling law are a subject of ongoing research.[17][18][9]

Hilberg's hypothesis has been discussed as evidence that language production involves infinite memory, contrasting with finite-state Markov models and hidden Markov models.[8] In general, Hilberg's law is incompatible with finite-state hidden Markov models. The mutual information between past and future is called excess entropy[7] or predictive information.[6][19] Condition holds for finite-state hidden Markov models[7] and non-degenerate autoregressive moving-average Gaussian processes.[20] By contrast, condition holds for non-ergodic processes with a non-atomic invariant sigma-algebra.[12] There also exist countably infinite-state hidden Markov models that enjoy condition [21] and Hilberg's law.[22][8]

A less direct is the link of Hilberg's law with the power-law logarithmic scaling of the maximal repetition length, empirically observed for natural language,[23] as this involves a power-law growth of Renyi entropies rather than the Shannon entropy.[24]

In the case of stationary processes over a countable alphabet, it is unknown whether Hilberg's law (under a mild condition) implies a slow decay of the two-point mutual information , which seems to hold for natural language.[25][26] For the non-ergodic Santa Fe processes, one also observes for .

In the case of real-valued non-degenerate Gaussian processes, condition is equivalent to , where is the partial autocorrelation function.[27]. Additionally, if the Gaussian process has a positive and continuous spectral density then condition is equivalent to , where is the autocorrelation function.[28] By contrast, the long-range dependence condition is ,[29] which is equivalent to the previous condition if the autocorrelation function decays exactly according to a power law. Thus Hilberg's law implies long-range dependence if the process is Gaussian, the spectral density is positive and continuous, and the autocorrelation function follows a power law.

See also

[edit]

References

[edit]
  1. ^ a b Hilberg, Wolfgang (1990). "Der bekannte Grenzwert der redundanzfreien Information in Texten - eine Fehlinterpretation der Shannonschen Experimente?". Frequenz. 44 (9–10): 243. Bibcode:1990Freq...44..243H. doi:10.1515/FREQ.1990.44.9-10.243.
  2. ^ Shannon, Claude Elwood (1951). "Prediction and Entropy of Printed English". Bell System Technical Journal. 30: 50–64. Bibcode:1951BSTJ...30...50S. doi:10.1002/j.1538-7305.1951.tb01366.x.
  3. ^ Dębowski, Łukasz (2015). "Hilberg Exponents: New Measures of Long Memory in the Process". IEEE Transactions on Information Theory. 61 (10): 5716–5726. arXiv:1403.1757. Bibcode:2015ITIT...61.5716D. doi:10.1109/TIT.2015.2470675.
  4. ^ Ebeling, W.; Nicolis, G. (1991). "Entropy of Symbolic Sequences: The Role of Correlations". Europhysics Letters (Epl). 14 (3): 191–196. Bibcode:1991EL.....14..191E. doi:10.1209/0295-5075/14/3/001.
  5. ^ Ebeling, W.; Pöschel, T. (1994). "Entropy and Long-Range Correlations in Literary English". Europhysics Letters (Epl). 26 (4): 241–246. arXiv:cond-mat/0204108. Bibcode:1994EL.....26..241E. doi:10.1209/0295-5075/26/4/001.
  6. ^ a b Bialek, William; Nemenman, Ilya; Tishby, Naftali (2001). "Predictability, Complexity, and Learning". Neural Computation. 13 (11): 2409–2463. Bibcode:2001NeCom..13.2409B. doi:10.1162/089976601753195969. PMID 11674845.
  7. ^ a b c Crutchfield, James P.; Feldman, David P. (2003). "Regularities unseen, randomness observed: Levels of entropy convergence". Chaos: An Interdisciplinary Journal of Nonlinear Science. 13 (1): 25–54. Bibcode:2003Chaos..13...25C. doi:10.1063/1.1530990. PMID 12675408.
  8. ^ a b c d e Dębowski, Łukasz (2021). "A Refutation of Finite-State Language Models through Zipf's Law for Factual Knowledge". Entropy. 23 (9): 1148. Bibcode:2021Entrp..23.1148D. doi:10.3390/e23091148. PMC 8465033. PMID 34573773.
  9. ^ a b Achille, Alessandro; Soatto, Stefano (2025). "AI Agents as Universal Task Solvers". arXiv:2510.12066v1 [cs.AI].
  10. ^ a b Takahira, Ryosuke; Tanaka-Ishii, Kumiko; Dębowski, Łukasz (2016). "Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora". Entropy. 18 (10): 364. doi:10.3390/e18100364.
  11. ^ Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models". arXiv:2001.08361 [cs.LG].
  12. ^ a b c d e Dębowski, Łukasz (2020). Information Theory Meets Power Laws. Wiley. doi:10.1002/9781119625384. ISBN 978-1-119-62527-8.
  13. ^ Tanaka-Ishii, Kumiko (2021). Statistical Universals of Language. Mathematics in Mind. Springer. doi:10.1007/978-3-030-59377-3. ISBN 978-3-030-59376-6.
  14. ^ a b c d e f Dębowski, Łukasz (2011). "On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts". IEEE Transactions on Information Theory. 57 (7): 4589–4599. arXiv:0810.3125. Bibcode:2011ITIT...57.4589D. doi:10.1109/TIT.2011.2145170.
  15. ^ a b c d Dębowski, Łukasz (2018). "Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited". Entropy. 20 (2): 85. arXiv:1706.04432. Bibcode:2018Entrp..20...85D. doi:10.3390/e20020085. PMC 7512648. PMID 33265176.
  16. ^ a b Dębowski, Łukasz (2012). "Mixing, Ergodic, and Nonergodic Processes with Rapidly Growing Information Between Blocks". IEEE Transactions on Information Theory. 58 (6): 3392–3401. arXiv:1103.3952. Bibcode:2012ITIT...58.3392D. doi:10.1109/TIT.2012.2190708.
  17. ^ a b Hutter, Marcus (2021). "Learning Curve Theory". arXiv:2102.04074 [cs.LG].
  18. ^ Dębowski, Łukasz (2023). "A Simplistic Model of Neural Scaling Laws: Multiperiodic Santa Fe Processes". arXiv:2302.09049v1 [cs.IT].
  19. ^ Futrell, Richard; Hahn, Michael (2025). "Linguistic structure from a bottleneck on sequential information processing". Nature Human Behaviour. doi:10.1038/s41562-025-02336-w. PMID 41286011.
  20. ^ Finch, P. D. (1960). "On the Covariance Determinants of Moving-Average and Autoregressive Models". Biometrika. 47 (1/2): 194–196. doi:10.2307/2332974. JSTOR 2332974.
  21. ^ Travers, Nicholas; Crutchfield, James (2014). "Infinite Excess Entropy Processes with Countable-State Generators". Entropy. 16 (3): 1396–1413. Bibcode:2014Entrp..16.1396T. doi:10.3390/e16031396.
  22. ^ Dębowski, Łukasz (2014). "On Hidden Markov Processes with Infinite Excess Entropy". Journal of Theoretical Probability. 27 (2): 539–551. doi:10.1007/s10959-012-0468-6.
  23. ^ Dębowski, Łukasz (2015). "Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture". Entropy. 17 (8): 5903–5919. doi:10.3390/e17085903.
  24. ^ Dębowski, Łukasz (2025). "Repetition and recurrence times: Dual statements and summable mixing rates". Electronic Communications in Probability. 30. doi:10.1214/25-ECP718.
  25. ^ Lin, Henry; Tegmark, Max (2017). "Critical Behavior in Physics and Probabilistic Formal Languages". Entropy. 19 (7): 299. arXiv:1606.06737. Bibcode:2017Entrp..19..299L. doi:10.3390/e19070299.
  26. ^ Wieczyński, Paweł; Dębowski, Łukasz (2025). "Long-Range Dependence in Word Time Series: The Cosine Correlation of Embeddings". Entropy. 27 (6): 613. Bibcode:2025Entrp..27..613W. doi:10.3390/e27060613. PMC 12191972. PMID 40566200.
  27. ^ Dębowski, Łukasz (2007). "On processes with summable partial autocorrelations". Statistics & Probability Letters. 77 (7): 752–759. doi:10.1016/j.spl.2006.11.012.
  28. ^ Li, Lei M. (2006). "Some Notes on Mutual Information Between Past and Future". Journal of Time Series Analysis. 27 (2): 309–322. doi:10.1111/j.1467-9892.2005.00469.x.
  29. ^ Beran, Jan (2017). Statistics for Long-Memory Processes. doi:10.1201/9780203738481. ISBN 978-0-203-73848-1.