Artificial Intelligence

2602 Submissions

[5] ai.viXra.org:2602.0128 [pdf] submitted on 2026-02-28 03:00:54

Context Length as Implicit Inductive Bias in Large Language Models: A Structured Review and Formal Synthesis

Authors: Sif Almaghrabi
Comments: 16 Pages.

We present a structured literature review synthesizing 72 publications across eight research streams to develop and evaluate the thesis that context length functions as an implicit inductive bias in large language models (LLMs). We formalize this claim through four operational diagnostics—output entropy, distributional shift under context perturbation, anchoring tendency, and search-space contraction—each defined as a measurable quantity derivable from the predictive distribution pθ(y | x, C). Five testable hypotheses are stated with explicit falsification conditionsand graded against a three-point study-qualityrubric. Four convergent patterns emerge: (i) robust non-monotonic accuracy as a function ofcontext length across tasks, models, and experimental controls; (ii) predictable interactions between context length and reasoning depth, with a difficulty-dependent optimum; (iii) measurable search-space contraction quantifiable via semantic entropy; and (iv) formal parallels to classicalinductive bias in overparameterized models. Thispaper does not introduce novel algorithms or experimental results; its contributions are a formal diagnostic framework, a quality-graded evidence matrix, a causal analysis of confounding factors limiting current claims, and a prioritized research agenda of six open problems with proposed experimental protocols.
Category: Artificial Intelligence

[4] ai.viXra.org:2602.0122 [pdf] submitted on 2026-02-26 10:02:56

Reasoning Trace Length and Accuracy in Large Language Models: A Structured Meta-Analysis of Published Benchmarks

Authors: Sif Almaghrabi
Comments: 22 Pages.

We present a structured meta-analysis examining the relationship between chain-of-thought (CoT) reasoning tracelength and task accuracy across 22 large language models spanning five provider families and 14 benchmarkscovering mathematics, code generation, scientific reasoning, and general knowledge. All results are drawn frompublished technical reports, system cards, and peer-reviewed evaluations; no new experiments are conducted. Weaggregate over 300 model—benchmark data points, though we note that cross-source comparisons are subject toprotocol heterogeneity that limits strict commensurability.We document five principal observational patterns: (1) Reasoning-augmented models consistently outperformtheir standard counterparts on hard multi-step tasks, with reported accuracy differences of 40—81 pp on competitionmathematics, though these differences confound reasoning-specific gains with concurrent architecture and trainingimprovements; (2) Within the single controlled setting where token-budget data are available (Claude 3.7 Sonneton AIME 2024, n = 30 test items), the accuracy—token relationship is well-described by a logarithmic fit(R2 = 0.97, n = 7 reconstructed data points), though this fit cannot be statistically distinguished from severalalternative functional forms given the small sample and measurement uncertainty; (3) The observed accuracydifferences are strongly domain-dependent, ranging from large positive gains on competition math to negativeeffects on factual recall; (4) Estimated per-query costs increase nonlinearly near the accuracy frontier, though costestimates carry substantial uncertainty from token accounting and pricing volatility; and (5) Published faithfulnessstudies report that visible CoT reflects actual model reasoning in only 25—39% of probed cases.We propose formal efficiency metrics, discuss their limitations, and provide a practitioner-oriented deploymentframework. All data tables are released. We classify our conclusions as observational rather than causal, anddiscuss the confounds that prevent stronger inference.
Category: Artificial Intelligence

[3] ai.viXra.org:2602.0101 [pdf] submitted on 2026-02-21 19:14:58

Conversation Fragility is Heavy-Tailed Quantile Reliability Curves for Multi-Turn LLM Evaluation

Authors: Michael Zot
Comments: 8 Pages. (Note by ai.viXra.org Admin: Please cite all listed scientific references)

Multi-turn dialogue is where large language models (LLMs) are most useful, and also where they most often "get lost". Prior work reports that average performance drops substantially from single-turn to multi-turn settings, and argues that the dominant driver is increased unreliability rather than a large loss of peak capability. We replicate and extend this picture using a quantile-based analysis over thousands of stochastic generations, with an emphasis on distribution shape rather than averages.Across seven jobs we analyze N=5,100 scored generations: 30 instructions per job, 10 stochastic runs per instruction, and 1 to 3 turns per run. For each instruction and turn we compute (i) aptitude A90, the 90th percentile of score across runs, and (ii) unreliability U90-10, the 90th to 10th percentile spread.Our core result is a heavy-tailed fragility surface: most instructions remain perfectly stable with U=0, while a small minority contribute most of the unreliability at later turns. Across multi-turn replications, the top 3 most fragile instructions at turn 2 explain 54% to 91% of total unreliability. This yields a practical taxonomy of dialogue dynamics (stable, monotone degradation, and instability then recovery) and suggests new training and evaluation targets: recovery and variance control, not just average accuracy.
Category: Artificial Intelligence

[2] ai.viXra.org:2602.0066 [pdf] submitted on 2026-02-13 20:21:39

Bounded Symbolic Observability: A Cross-Domain Constraint in Computational Dynamics

Authors: David Taylor
Comments: 28 Pages. (Note by viXra Admin: Please cite and list scientific references)

Finite local symbolic observation exhibits bounded vocabularies across diverse computational domains despite systematic increases in observational scale. We apply afixed local symbolic encoding framework to 13 systems spanning quantum mechanics, fluid dynamics, thermodynamics, electromagnetism, chaos theory, number theory, combinatorial logic, and stochastic processes. Across all domains, observed symbolic vocabularies saturate, with a median final growth of 0.0% despite 100—1,000× increases in data volume, temporal extent, or problem size. Prime gap dynamics provides the strongest validation: an infinite, deterministic mathematical sequence with no physical dynamics saturates at 837 symbolic configurations across a 10,000× scale increase (100,000 to 1,000,000,000 primes,identical vocabulary), eliminating physical mechanisms as explanations. At one billionprimes, each of the 837 patterns is reused approximately 1.2 million times. Ten domainsachieve perfect saturation (0.0%), two near-perfect (<1%), and one strong (<20%). Symbolic space occupancy ranges from 0.08% (Schrödinger equation) to 92.35% (electromagnetic waves); both regimes nonetheless exhibit saturation. Saturation manifests independently of physical validity (thermodynamically invalid antidiffusion saturates identically to correct heat diffusion), determinism (chaotic andstochastic systems both saturate), and computational complexity (NP-complete 3-SATcollapses to eight symbolic patterns). These results indicate that bounded symbolicobservability reflects properties of finite local observation applied to locally-constraineddynamics rather than intrinsic system complexity—a constraint on measurement, not nature. Quantitative vocabularies are specific to the observational architecture employed; the empirical claim concerns the cross-domain emergence of vocabulary saturation under fixed local symbolic observation.
Category: Artificial Intelligence

[1] ai.viXra.org:2602.0039 [pdf] submitted on 2026-02-08 04:40:53

Role Based Multi Agent Reasoning Frameworks

Authors: Isaiah Nwukor
Comments: 15 Pages.

Individual artificial intelligence systems face an inherent trade-off between plasticity and stability under resource constraints. I propose that general intelligence emerges from networks of specialized agents applying a structured reasoning cycle to answer four fundamental questions. Agents ground abstract patterns through affective valence embeddings and coordinate via a shared database of credibility-weighted knowledge packages. I formalize a five-stage reasoning engine (Salience Detection → Hypothesis Generation → Experimentation → Structural Correspondence → Generalization) where agents at different stages specialize in different questions, enabling zero-shot cross-domain transfer. Using ARC-AGI task "as66" as demonstration, I show 276 generations of evolutionary learning where complementary specialization yields a current maximum of Level 4 performance across agents [20]. This framework provides testable predictions for performance scaling, transfer capability, and behavioral signatures of reasoning integration.
Category: Artificial Intelligence