Skip to content

Commit af2b412

Browse files
Add AION-Search blog post (#6)
* Fixed walrus splash link * Added AION-Search blog * Added cover image credit * Updated AION-Search blog * Converted .png to .jpg
1 parent 5a9721d commit af2b412

7 files changed

Lines changed: 87 additions & 2 deletions

File tree

‎_posts/2025-11-27-walrus-steering.md‎

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ title: "Steerable Representations of Abstract Physics in Walrus"
44
authors: Rio Alexa Fear, Payel Mukhopadhyay, Michael McCabe, Alberto Bietti, Miles Cranmer, The PolymathicAI Collaboration
55
shorttitle: "Steerable Representations of Abstract Physics in Walrus"
66
date: 2025-11-29 11:00
7-
smallimage: physics-steering-splash.jpg
8-
image: physics-steering-splash.jpg
7+
smallimage: walrus_steering/paper-schematic-W.png
8+
image: walrus_steering/paper-schematic-W.png
99
blurb: Discovering that physics foundation models can learn steerable, domain-general representations of physical concepts.
1010
shortblurb: Discovering that physics foundation models can learn steerable, domain-general representations of physical concepts.
1111
splashimage: /images/blog/walrus_steering/paper-schematic-W.png

‎_posts/2025-12-16-aion-search.md‎

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
---
2+
layout: post
3+
title: "AION-Search: Semantic search for 100M+ galaxy images using AI-generated captions"
4+
authors: Nolan Koblischke, Liam Parker, Francois Lanusse, Irina Espejo Morales, Jo Bovy, Shirley Ho
5+
shorttitle: "AION-Search: Semantic Search for Galaxy Images"
6+
date: 2025-12-16 11:00
7+
image: aion-search/aionsearchsplash.jpg
8+
smallimage: aion-search/aionsearchsplash.jpg
9+
blurb: The first system to enable meaning-based search across 140 million galaxy images with no human annotation required.
10+
shortblurb: Semantic search across 140 million galaxy images using AI-generated captions.
11+
splashimage: /images/blog/aion-search/aionsearchsplash.jpg
12+
link: https://arxiv.org/abs/2512.11982
13+
github_link: https://github.com/NolanKoblischke/AION-Search
14+
permalink: /blog/aion-search/
15+
---
16+
17+
How could we best leverage "a country of geniuses in a datacenter" ([1](#fn1)) to explore massive scientific datasets and unveil discoveries?
18+
19+
When it comes to astrophysics, we are producing imaging data at a scale that makes manual interpretation impossible. These datasets contain hundreds of millions of galaxy images, and upcoming telescopes will increase this to billions. Extracting scientific value from survey datasets has traditionally required human annotation, even in the age of machine learning. However, human labelling is often limited to predefined categories and requires substantial time and coordination. We need semantic search: the ability to search based on meaning.
20+
21+
AION-Search uses large language models (LLMs) that can process image data, such as GPT-4, to generate captions for unlabeled galaxy images and is the first system to enable meaning-based search across galaxy images with absolutely no human annotation required. It allows researchers to search by scientific intent rather than label availability, an essential tool for exploring massive datasets for rare phenomena in which the majority of observed objects may not be cataloged or classified at all.
22+
23+
---
24+
25+
Under the hood, AION-Search works in three steps:
26+
#### 1. Caption generation
27+
28+
First, a galaxy image is shown to an image-capable language model (such as GPT-4.1-mini) and it is asked to describe the observable features in scientific terms. The model produces short descriptions (e.g., "face-on spiral with two arms and a central bar"), which are then converted into numerical representations that encode the meaning of the description. These captions serve as the semantic reference that later allows the system to search by concept rather than by visual similarity.
29+
30+
At this point, one might ask: if captions already provide a searchable semantic representation, why not simply generate captions for all galaxies?
31+
32+
The answer is cost.
33+
34+
Generating high-quality scientific descriptions for every image using vision-language models would be computationally and financially prohibitive. We need a way to obtain these semantic representations directly from images, without having to caption each one individually.
35+
36+
<p align="center">
37+
<img src="/images/blog/aion-search/fig1.jpg" alt="Caption generation process" width="65%">
38+
</p>
39+
40+
#### 2. Contrastive alignment
41+
42+
To address this, we train the model so that images and their corresponding descriptions end up close to each other in the same representation space. The image embedding from AION-1 and the meaning embedding from the caption are pulled together, while mismatched image-text pairs are pushed apart. This process is referred to as contrastive learning. After alignment, the model can predict the semantic embedding directly from an image, eliminating the need to generate captions for every sample. We use AION-1 here because its representations are already physically meaningful and well-suited for capturing galaxy morphology (learn more about AION-1 [here](/blog/aion-1/)).
43+
44+
This is where we get the ability to search massive datasets with language queries such as "visible spiral arms":
45+
46+
<p align="center">
47+
<img src="/images/blog/aion-search/fig2.jpg" alt="Contrastive alignment" width="90%">
48+
</p>
49+
50+
#### 3. Improving discovery with re-ranking
51+
52+
After a semantic query is made, the system retrieves images whose embeddings are closest to the query in semantic space. For rare or subtle phenomena, only a small fraction of these candidates may be true matches. In a traditional workflow, a human expert would now manually examine the top few hundred images to determine which ones actually contain the feature of interest.
53+
54+
Instead, AION-Search delegates this review step to a more capable model which evaluates each candidate and assigns a relevance score based on how well it matches the query. The results are then reordered according to these scores and targeted phenomena rise to the top of the list—useful especially when searching for rare phenomena such as strong gravitational lenses.
55+
56+
<p align="center">
57+
<img src="/images/blog/aion-search/fig3.jpg" alt="Re-ranking process" width="60%">
58+
</p>
59+
60+
---
61+
62+
#### Implications
63+
64+
For the first time, astronomers can free-form search datasets with millions of images just using the search engine. Through this, researchers will not only be able to find the objects they have in mind, but potentially land on new, serendipitous discoveries, or unknown unknowns! We believe AION-Search is a flexible way to explore these sorts of large, image-based datasets, and that similar technology applied to other domains of science could change how researchers interact with data.
65+
66+
---
67+
68+
#### Try out AION-Search!
69+
70+
We have a public app to enable search over a ~20 million galaxy subset of the full dataset.
71+
72+
<p align="left">
73+
<a href="https://huggingface.co/spaces/astronolan/AION-Search" target="_blank" class="button-post">Try AION-Search</a>
74+
</p>
75+
76+
<p align="center">
77+
<img src="/images/blog/aion-search/fig4.jpg" alt="AION-Search demo" width="90%">
78+
</p>
79+
80+
*-- Sophie Barstein, Nolan Koblischke*
81+
82+
References.
83+
84+
<p id="fn1">(1) Dario Amodei, <a href="https://darioamodei.com/machines-of-loving-grace">"Machines of Loving Grace"</a>, 2024.</p>
85+
Cover image credit: DESI Legacy Imaging Surveys
228 KB
Loading

‎images/blog/aion-search/fig1.jpg‎

149 KB
Loading

‎images/blog/aion-search/fig2.jpg‎

912 KB
Loading

‎images/blog/aion-search/fig3.jpg‎

373 KB
Loading

‎images/blog/aion-search/fig4.jpg‎

116 KB
Loading

0 commit comments

Comments
 (0)