Making centuries of Buddhist iconographic knowledge accessible through purpose-built computer vision

Buddhist art is one of the richest visual traditions on earth — thousands of deities, bodhisattvas, protectors, and teachers, each depicted with precise iconographic rules developed over centuries. A single thangka painting can contain dozens of figures, each identifiable by their body color, hand gestures (mudras), sacred objects, posture, and companion figures. This knowledge lives in the minds of trained scholars and monks, but is inaccessible to most people who encounter Buddhist art.
TermaVision is not a generic AI or a fine-tuned large language model. It is a small, purpose-built vision model with a unique architecture designed from the ground up for one task: identifying sacred figures in Buddhist artwork. It knows nothing else — it only speaks Buddhist art.
What makes it different is its architecture. The model detects individual figures in a complex composition, then classifies each one against 93 trained classes. But it doesn't stop there — it cross-references every identification against a hand-built iconography database of 557 deities, checking body color, hand gestures, sacred objects, and posture. Then it applies compositional reasoning: knowledge of traditional Buddhist groupings like the Rigsum Gonpo (Avalokiteshvara, Manjushri, Vajrapani) or the Five Dhyani Buddhas, using the presence of one figure to confirm or correct the identification of others.
No generic AI can do this. It requires domain-specific architecture that encodes centuries of iconographic knowledge into the system itself. TermaVision serves scholars, museums, practitioners, and anyone who encounters Buddhist art and wants to understand what they are seeing.
Not a large language model — a small, specialized vision model with a custom architecture built from the ground up for Buddhist art. It knows nothing else.
Multi-figure detection — finds all individual figures in a complex thangka containing dozens of deities, even in crowded compositions with overlapping figures
93-class deity classification — identifies deities, bodhisattvas, dharma kings, teachers, arhats, and protectors across Tibetan, Himalayan, and broader Buddhist traditions
Iconography database — a hand-built knowledge base of 557 deities with body colors, mudras, sacred objects, postures, and lineage information used to verify every identification
Compositional reasoning — encodes knowledge of 8 traditional Buddhist groupings (Rigsum Gonpo, Tse Lha Nam Sum, Five Dhyani Buddhas, Eight Great Bodhisattvas, and others) to use the presence of one figure to confirm or correct identification of others
Iconographic output — returns Tibetan name, Sanskrit name, lineage, category, known aliases, and associated symbolism for every identified figure
Locates every individual figure in the artwork, even in complex multi-figure thangkas. Adapts automatically — no retraining needed when new figure types are added.
Each detected figure is isolated and classified independently against 93 trained classes. The model is purpose-built and lightweight — not a general AI repurposed for this task.
Every identification is cross-checked against a hand-built database of 557 deities — verifying body color, hand gestures, sacred objects, and posture. This is where domain knowledge is encoded directly into the system.
The system understands how Buddhist figures appear together. Knowledge of 8 traditional groupings allows it to use context — if Avalokiteshvara is present, it knows to look for Manjushri and Vajrapani nearby.
Color mismatch detection, duplicate detection, and confidence calibration. Each identification receives a confidence level: confident, likely, ambiguous, or uncertain.
Thupten N. Chakrishar · 2025
Abstract
This paper presents TermaVision, an automated multi-stage pipeline that combines frozen vision-language features, a lightweight classifier, zero-shot attribute verification, and a structured iconography knowledge graph to identify Buddhist figures in thangka paintings, statues, and murals. The system achieves 94.4% top-1 accuracy across 93 classes, processing images in 1.3–4 seconds on a consumer GPU. Unlike generic large language models, TermaVision employs a purpose-built architecture with a zero-shot subject filter, compositional reasoning encoding canonical Buddhist groupings, and a 557-deity knowledge graph with structured iconographic attributes. The pipeline also provides zero-shot provenance classification, iconometric proportion assessment based on the traditional Tibetan Angula system, and regional artistic style classification.
Key Findings
94.4% top-1 accuracy across 93 deity classes using a purpose-built vision model
557-deity iconography knowledge graph with 9 attribute categories for explainable verification
Compositional reasoning module encoding 8 canonical Buddhist figure groupings
Zero-shot provenance, iconometric, and regional style analysis without additional training data
Processes images in 1.3–4 seconds vs. 8–35 seconds for the prior system