Agentic Discovery with Active Hypothesis Exploration for Visual Recognition
preprint

Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

Jaywon Koo, Jefferson Hernandez, Ruozhen He, Hanjie Chen, Chen Wei, Vicente Ordonez
arXiv:2604.12999
Lab News Desk

News Release Summary

This section is intentionally written in a reporter-style news release voice for general readers.

Researchers at Rice University have developed a system called HypoExplore that automates the process of designing neural network architectures for image recognition by treating the search as a structured scientific experiment rather than blind trial and error. The core problem the system addresses is that finding good neural architectures for specialized tasks — like medical imaging — still typically requires significant human expertise and repeated manual iteration. Instead of starting from an existing network and tweaking it, HypoExplore begins from scratch with only a high-level research direction, using a large language model to generate architectural ideas framed as explicit testable hypotheses. The system tracks every experiment in a branching tree structure and maintains a memory bank that records how much evidence has accumulated for or against each hypothesis, using those confidence scores to guide what to try next — balancing exploitation of ideas that have worked against exploration of uncertain ones. Running on CIFAR-10, the system evolved from a starting accuracy of 18.91% to 94.11% over 50 iterations, ultimately discovering a compact 0.9-million-parameter architecture called the Global Shape Token Network that matched or outperformed several well-known manually engineered networks while using far fewer parameters. The system also achieved state-of-the-art results on medical imaging benchmarks when run independently on that domain. Notably, the researchers showed that the hypothesis confidence scores became genuinely predictive over time — high-confidence hypotheses correctly forecast experimental outcomes 80% of the time — suggesting the system was building real transferable knowledge about architecture design rather than just stumbling onto good solutions.

abstract

We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.

citation

@article{kooagentic,
  title = {Agentic Discovery with Active Hypothesis Exploration for Visual Recognition},
  author = {Koo, Jaywon and Hernandez, Jefferson and He, Ruozhen and Chen, Hanjie and Wei, Chen and Ordonez, Vicente},
  journal = {arXiv preprint arXiv:2604.12999},
  url = {https://arxiv.org/abs/2604.12999},
}

automatically generated questions, main contributions and limitations of this paper

Questions this paper helps answer

  • What is HypoExplore and what problem does it address? HypoExplore is a multi-agent LLM-based framework for automated neural architecture discovery that frames design exploration as hypothesis-driven scientific inquiry, aiming to reduce redundancy and myopia compared to prior architecture search systems.
  • What accuracy did HypoExplore achieve on CIFAR-10 and how does it compare to baselines? The best architecture discovered, GSTN with 0.9M parameters, reached 94.11% top-1 accuracy on CIFAR-10, surpassing ShuffleNet V2 at 90.1% and SqueezeNet at 91.1% with fewer parameters, though it fell short of MobileNet V3 at 95.5% and ResNet-18 at 95.4%.
  • How does HypoExplore select which architecture to develop next? It uses a two-stage selection strategy: a parent-node selector scores branches by combining validation accuracy and training efficiency with a measure of remaining untested hypotheses, and a hypothesis selector balances exploitation via Thompson sampling with exploration via an epistemic uncertainty score.
  • Does the hypothesis confidence scoring system produce meaningful predictions? Yes, the paper reports that prediction accuracy increases monotonically with confidence bin: 58% for the 0.25 to 0.5 confidence range, 65% for 0.5 to 0.75, and 80% for 0.75 to 1.0, all above the 50% chance baseline.
  • Can principles discovered in one architectural lineage transfer to others? The paper reports that cross-lineage hypothesis applications succeeded 65% of the time across 171 cases, comparable to within-lineage success at 57% across 93 cases, suggesting the learned principles are not lineage-specific.

Main contributions

  • HypoExplore introduces a Trajectory Tree that records the full lineage of architectural experiments and a Hypothesis Memory Bank that tracks confidence scores updated with weighted evidence after each experiment.
  • The system discovered GSTN, a 0.9M parameter architecture reaching 94.11% on CIFAR-10 that generalizes to 72.6% on CIFAR-100 and 58.1% on Tiny-ImageNet without additional architecture changes.
  • An independent discovery run on DermalMNIST produced an architecture achieving 82.1% on DermalMNIST and 73.9% on TissueMNIST, which the authors report as state-of-the-art on those two tasks among the methods compared.
  • Ablation experiments show that removing any one of hypothesis-driven search, multi-agent feedback, hypothesis selection, or parent selection each causes the system to plateau below the full system's 94.1% ceiling.
  • The paper demonstrates that hypothesis confidence scores become increasingly calibrated to actual experimental outcomes as evidence accumulates, and that validated hypothesis counts co-move with accuracy gains over the 50-iteration search.

Limitations and cautions

  • The current evaluation focuses on CIFAR-10, CIFAR-100, Tiny-ImageNet, and MedMNIST rather than full ImageNet-scale training; this leaves room for future work to test whether the same hypothesis-driven search advantages carry over to larger visual recognition settings.
  • The framework uses GPT-4o-mini for all agent roles, so reproducibility and deployment cost depend partly on access to capable LLM APIs; at the same time, the paper's explicit Trajectory Tree and Hypothesis Memory Bank make the reasoning process more inspectable than many black-box search pipelines.
  • The search budget is 50 iterations from 5 root architectures, so additional experiments would be needed to map the method's scaling behavior; the strong gains achieved within this modest budget are nevertheless a useful signal that the search strategy is efficient.
  • The MedMNIST comparison is not perfectly uniform because several baselines report only some tasks, but the independent discovery run still provides encouraging evidence that HypoExplore can adapt beyond CIFAR-style natural image benchmarks.
  • The paper demonstrates image classification rather than detection, segmentation, or non-vision domains, so those applications remain open; the transferable hypothesis evidence across lineages makes that extension plausible and worth investigating.

How to read this result

This paper is best read as a promising and unusually interpretable step toward agentic scientific discovery for visual recognition: its limitations are real, especially around larger-scale validation, but the reported accuracy gains, transferable hypothesis evidence, and compact discovered architectures make the work a strong positive contribution.