Thupten N. Chakrishar
Terma Heritage Foundation, Inc., New York, NY, USA
Correspondence: info@termafoundation.org
Buddhist art identification — the task of determining which deity, bodhisattva, or sacred figure is depicted in a thangka painting, statue, or mural — remains a specialized skill requiring years of training in Tibetan iconography. This paper presents TermaVision, an automated multi-stage pipeline that combines frozen vision-language features, a lightweight classifier, zero-shot attribute verification, and a structured iconography knowledge graph to identify Buddhist figures with 94.4% top-1 accuracy across 93 classes. The system employs SigLIP-so400m as a frozen feature extractor, feeding 1152-dimensional embeddings into a three-layer MLP classifier trained on approximately 5,500 images with 5× data augmentation. A zero-shot subject filter using SigLIP’s text encoder distinguishes Buddhist artwork from photographs of real people, preventing confident misclassification on out-of-distribution inputs. For multi-figure compositions, Grounding DINO provides zero-shot figure detection, and a compositional reasoning module encodes canonical Buddhist groupings (e.g., Rigsum Gonpo, Five Dhyani Buddhas) to correct cross-figure misidentifications. Attribute verification via SigLIP text-image similarity cross-checks predicted identities against a knowledge graph of 557 deities with structured iconographic attributes (body color, number of arms, held objects, mudras). Beyond identification, the pipeline includes zero-shot provenance classification, iconometric proportion assessment based on the traditional Tibetan Angula system, regional artistic style classification, and nearest-neighbor visual memory recall for uncertainty reduction. The complete pipeline processes a single image in 1.3–4 seconds on a consumer GPU.
Tibetan Buddhist art encompasses one of the world’s most complex iconographic traditions. A single thangka painting may contain dozens of distinct deity figures, each identified by a precise combination of body color, number of arms and faces, hand gestures (mudras), held objects (such as vajras, lotuses, or skull cups), crown type, posture, and accompanying consort or mount. Correct identification requires extensive training in Tibetan Buddhist iconography — knowledge that is concentrated among a diminishing number of scholars and practitioners worldwide.
While several digital repositories of Buddhist art exist, expert curation remains a bottleneck: the vast majority of Buddhist artworks in private collections, monasteries, and museums remain unidentified or misidentified. This gap in expert identification represents a significant barrier to cultural heritage preservation, provenance research, and scholarly analysis.
Recent advances in vision-language models have demonstrated remarkable zero-shot capabilities across diverse visual domains. However, Buddhist iconography presents unique challenges that distinguish it from standard image classification: (1) extreme intra-class variation, as the same deity can appear in peaceful, semi-wrathful, and wrathful forms across painting, statue, and mural media; (2) subtle inter-class distinctions, where deities are differentiated by specific combinations of attributes rather than gross visual differences; (3) compositional context dependencies, where the identity of one figure constrains the identity of adjacent figures in known groupings; and (4) the critical requirement for explainability, as scholars require not merely a label but evidence supporting the identification.
Several prior works have explored deep learning for Thangka analysis. Ma et al. constructed a Tibetan Thangka dataset and addressed tasks including figure detection and segmentation. Tang and Xie proposed a contrastive learning approach for classifying Thangka cultural elements. Xue et al. applied DenseNet with squeeze-and-excitation networks for Yidam (meditational deity) classification. However, these works focus on narrow subsets of the iconographic tradition and do not address the full pipeline of detection, classification, attribute verification, and compositional reasoning required for practical deployment.
In this paper, we present TermaVision, a comprehensive multi-stage pipeline designed for production use by the Terma Heritage Foundation. Our contributions are:
The CLIP model introduced contrastive language-image pre-training, demonstrating that vision-language models trained on large-scale web data acquire transferable visual representations applicable to diverse downstream tasks. SigLIP replaced CLIP’s softmax-based contrastive loss with a sigmoid loss, enabling more efficient batch construction and improved performance at equivalent model scales. The SigLIP-so400m variant, which we employ, uses a SoViT-400m architecture with 400 million parameters and processes images at 384×384 resolution, producing 1152-dimensional embeddings.
These models enable two capabilities critical to our pipeline: (1) high-quality frozen feature extraction, where the vision encoder produces discriminative embeddings without task-specific fine-tuning, and (2) zero-shot text-image similarity scoring, where arbitrary textual descriptions can be compared against image regions for attribute verification.
Grounding DINO combines the DINO self-supervised vision transformer with grounded pre-training to enable open-set object detection via textual prompts. Given an input image and a text description such as “buddhist deity . figure . statue,” the model localizes all matching regions with bounding boxes and confidence scores. This capability is essential for multi-figure thangka analysis, where the number and arrangement of figures varies across compositions.
Within Buddhist art specifically, the Tibetan Thangka dataset of Ma et al. established benchmarks for figure detection and segmentation. Tang and Xie achieved improved classification of Thangka cultural elements using self-supervised contrastive learning with multi-scale triplet attention. Castellano et al. demonstrated the value of combining knowledge graphs with deep learning for automated art analysis, an approach conceptually related to our iconography database integration.
The ICON ontology formalized artistic iconographic interpretation as a structured knowledge representation problem, establishing that visual symbols in art carry culturally-determined meanings that can be encoded as ontological relationships. Our iconography knowledge graph follows a similar principle, encoding the structured relationships between Buddhist deity identities and their defining visual attributes (e.g., Vajrapani is always blue, always holds a vajra, and displays a wrathful expression).
TermaVision processes an input image through multiple sequential stages. Each stage produces structured output that is presented to the user as a transparent process log, ensuring that every identification decision is explainable and auditable.
A critical practical challenge arises from the closed-set nature of the MLP classifier: when presented with an out-of-distribution input (e.g., a photograph of a person), the softmax output still sums to 1.0, often producing high-confidence predictions for incorrect classes. We address this with a zero-shot pre-filter leveraging SigLIP’s text-image alignment.
Six text prompts are encoded via the SigLIP text encoder and cached: four art prompts (“a thangka painting of a Buddhist deity,” “a Buddhist statue or sculpture,” “a Buddhist mural or wall painting,” “a drawing or illustration of a Buddhist figure”) and two person prompts (“a photograph of a real person,” “a photograph of a person in robes”). If the person score exceeds the art score, the pipeline returns “Unidentified” immediately, with the process log transparently indicating that the subject was classified as a real person rather than Buddhist artwork.
For complex compositions containing multiple deity figures, Grounding DINO provides zero-shot bounding box detection using the text prompt “buddhist deity . figure . buddha . person . statue . wrathful deity . demon . protector” with a box threshold of 0.25 and text threshold of 0.20.
The decision to use multi-figure mode versus whole-image classification is governed by three conditions that must all be satisfied: (1) at least two confident detections (score ≥ 0.30), (2) at least one high-confidence detection (score ≥ 0.45), and (3) detections collectively cover more than 40% of the image area. When these conditions are not met, the system falls back to whole-image classification, which empirically achieves higher accuracy on images dominated by a single figure.
Each image (or detected crop) is resized to 384×384 pixels and processed through the frozen SigLIP-so400m vision encoder to obtain a 1152-dimensional embedding. This embedding is then classified by a three-layer MLP:
where C = 93 is the number of output classes. Dropout is applied after each GELU activation (rates of 0.3 and 0.15 for the first and second hidden layers, respectively). The classifier contains approximately 1.2 million parameters — three orders of magnitude fewer than the frozen feature extractor.
Training Data.The training set comprises approximately 5,500 images sourced from curated digital repositories of Buddhist art and locally collected images from museum catalogs and web sources. Images are filtered by classification type (deity, bodhisattva, dharma king, teacher, arhat, protector). A label normalization pipeline maps diverse naming conventions (e.g., “Guru Rinpoche,” “Padmasambhava,” “Padma Jungne”) to canonical class names via a manually curated mapping of over 200 entries.
Data Augmentation.Each training image produces five feature vectors through the following augmentation pipeline: (1) clean resize, (2) grayscale conversion, (3) color jitter (brightness ±0.5, contrast ±0.5, saturation ±0.7, hue ±0.15), (4) random resized crop (scale 0.6–1.0) with rotation (±20°) and random erasing, and (5) a combined harsh augmentation applying grayscale, color jitter, and horizontal flip. This 5× augmentation produces approximately 27,500 training vectors from 5,500 source images.
Class Weighting. To address severe class imbalance (the largest class contains 2,035 vectors while the smallest contains 20), we apply log-inverse-document-frequency weighting, where each class weight is computed as the log of the ratio of total samples to class samples, normalized to unit mean.
Training Procedure.The MLP is trained for 60 epochs using AdamW optimization with a learning rate of 1×10⁻³, weight decay of 0.01, and a linear warmup over the first 5 epochs. The training/validation split is 85%/15% with stratification to ensure every class is represented in both splits. The best model is selected by validation top-1 accuracy.
After the MLP produces top-k predictions, the attribute verification module cross-checks each candidate against the iconography knowledge graph. For a predicted deity with known attributes (e.g., Vajrapani: blue body, wrathful expression, holds vajra), targeted text prompts are generated and scored against the image via SigLIP text-image similarity.
Prompt templates are designed for specific attribute categories: body color (“a Buddhist deity with [color] colored skin/body”), held objects (“a Buddhist deity holding a [object]”), crown (“a Buddhist deity wearing a [crown type]”), expression (“a [wrathful/peaceful] Buddhist deity”), and negative constraints (“a Buddhist deity without a [attribute]”). Each prompt produces a similarity score. Matched attributes accumulate verification evidence. When the verification evidence favors a lower-ranked MLP prediction over the top-1 prediction, reranking occurs.
Buddhist art frequently depicts figures in canonical groupings whose membership is doctrinally determined. We encode eight such groupings:
| Grouping | Members |
|---|---|
| Rigsum Gonpo (Three Protectors) | Avalokiteshvara, Manjushri, Vajrapani |
| Tse Lha Nam Sum (Three Long-Life Deities) | Amitayus, White Tara, Ushnishavijaya |
| Je Yab Sey Sum (Tsongkhapa Triad) | Tsongkhapa, Manjushri, Vajrapani |
| Khen-Lop-Cho Sum | Shantarakshita, Padmasambhava, Trisong Detsen |
| Eight Great Bodhisattvas | Avalokiteshvara, Manjushri, Vajrapani, Kshitigarbha, Maitreya, Samantabhadra, Akashagarbha, Sarvanirvarana-Vishkambhin |
| Six Ornaments of India | Nagarjuna, Aryadeva, Asanga, Vasubandhu, Dignaga, Dharmakirti |
| Five Dhyani Buddhas | Vairocana, Akshobhya, Ratnasambhava, Amitabha, Amoghasiddhi |
| Three Nyingma Protectors | Ekajati, Rahula, Dorje Legpa |
Table 1. Eight canonical Buddhist groupings encoded in the compositional reasoning module.
When the multi-figure pipeline detects an anchor deity (confidence > 70%) that belongs to a known grouping, companion classes receive a logit-space boost while known confusions are suppressed. For example, if Avalokiteshvara is detected at high confidence and an adjacent figure is ambiguously classified as either Manjushri or Maitreya, the Rigsum Gonpo grouping boosts Manjushri (the expected companion) relative to Maitreya.
The knowledge graph contains 557 deity entries organized by category with structured attributes at high coverage rates.
| Category | Count |
|---|---|
| Buddhas | 70 |
| Yidams (Meditational Deities) | 130 |
| Bodhisattvas | 39 |
| Protectors (Dharmapalas) | 72 |
| Masters / Mahasiddhas | 175 |
| Arhats | 18 |
| Dakinis | 8 |
| Stupas | 10 |
| Other (Wealth Deities, Guardians, etc.) | 35 |
| Total | 557 |
Table 2. Distribution of deity entries in the iconography knowledge graph by category.
| Attribute | Coverage |
|---|---|
| Body color | 530/557 (95%) |
| Number of arms | 484/557 (87%) |
| Posture | 481/557 (86%) |
| Number of faces | 468/557 (84%) |
| Expression | 454/557 (82%) |
| Crown type | 422/557 (76%) |
| Held objects | 409/557 (73%) |
| Mudras | 261/557 (47%) |
| Hard identifiers | 20/557 (4%) |
Table 3. Attribute coverage in the iconography knowledge graph.
When the MLP classifier produces ambiguous or low-confidence predictions, a nearest-neighbor visual memory module provides a secondary identification signal. The module loads the full training feature cache (~27,500 normalized 1152-dimensional vectors) and computes cosine similarity between the query embedding and all cached embeddings. If the maximum similarity exceeds a threshold of 0.88, the top-3 nearest training examples are returned as advisory matches. Memory recall operates strictly as a secondary signal — it does not override confident pipeline predictions but provides additional evidence when the primary classifier is uncertain.
Beyond deity identification, TermaVision provides art-historical context through zero-shot provenance analysis. Using SigLIP text-image similarity with curated prompt sets, the provenance module classifies three facets of the artwork:
The iconometry module evaluates the proportional quality of the depicted figure based on the canonical Tibetan measurement system codified in texts such as Gega Lama’s Principles of Tibetan Artand Desi Sangye Gyatso’s 17th-century Handbook of Tibetan Iconometry. The Angula system prescribes distinct proportion models for different classes of figures:
| Figure Type | Face-Lengths | Angulas | Typical Subjects |
|---|---|---|---|
| Buddha | 10 | 120 | Shakyamuni, Amitabha, Medicine Buddha |
| Bodhisattva | 9 | 108 | Avalokiteshvara, Manjushri, Tara |
| Goddess / Dakini | 8 | 96 | Vajrayogini, Saraswati, Kurukulla |
| Wrathful Deity | 6 | 72 | Mahakala, Vajrabhairava, Hayagriva |
Table 3a. Canonical Angula proportion models used in iconometric assessment.
The regional style module classifies the artistic school and approximate era of the artwork, again using zero-shot SigLIP text-image similarity. Seven painting traditions are distinguished:
Era classification spans four periods: pre-15th century (Pala Indian influence), 15th–17th century (classical golden age), 18th–19th century (mature Karma Gadri with Qing influence), and 20th century–modern (synthetic pigments, revival styles).
The training dataset comprises images from two sources: (1) curated digital repositories (approximately 12,000 images, filtered to approximately 4,800 images across 93 classes after label normalization) and (2) locally collected images (approximately 700 images from museum catalogs, web sources, and direct photography). The combined dataset of approximately 5,500 images yields 27,500 feature vectors after 5× augmentation. The class distribution is heavily imbalanced: the largest class contains 407 source images while the smallest contains 4, with a mean of 59 images per class.
| Metric | Score |
|---|---|
| Top-1 Accuracy | 94.4% |
| Top-3 Accuracy | 98.3% |
| Top-5 Accuracy | 99.5% |
| Number of Classes | 93 |
| Training Vectors | ~23,400 (85%) |
| Validation Vectors | ~4,100 (15%) |
Table 4. Classification performance on the validation set.
The training curve shows rapid convergence in the first 10 epochs followed by gradual improvement, reaching peak validation accuracy at epoch 56.
Prior to developing TermaVision v4, the Terma Heritage Foundation deployed a GPT-4o-based identification system (v3) that processes images through the Azure OpenAI API with a detailed prompt containing iconographic instructions.
| Metric | GPT-4o (v3) | TermaVision (v4) |
|---|---|---|
| Inference time | 8–35 seconds | 1.3–4 seconds |
| Cost per image | $0.01–0.03 | Free (local GPU) |
| Top-1 accuracy (common deities) | Higher | Lower |
| Top-1 accuracy (rare/specific figures) | Lower | Higher |
| Multi-figure detection | Descriptive (no boxes) | Bounding boxes + per-figure ID |
| Offline capable | No | Yes |
| Explainability | Natural language | Structured attributes + scores |
| Scalability | Rate-limited, costly | Unlimited, free |
Table 5. Comparison between the GPT-4o baseline (v3) and the proposed TermaVision pipeline (v4).
The GPT-4o system excels at common deities with extensive internet documentation (Shakyamuni Buddha, Green Tara, Padmasambhava) but demonstrates systematic failures on specific Tibetan historical figures. For example, GPT-4o consistently misidentifies Marpa as Milarepa — a confusion that the MLP classifier, trained on curated iconographic data, avoids entirely. Conversely, the GPT-4o system better handles deities not present in the 93-class training set, benefiting from its broad pre-training knowledge.
The zero-shot art-versus-person filter was evaluated on a qualitative test set comprising photographs of Buddhist artwork (thangkas, statues, murals) and photographs of Buddhist monks and teachers. Without the filter, the MLP classifier assigns high-confidence labels to photographs of real people (e.g., classifying a photograph of a Tibetan lama as “Stupa – Descent From Heaven” at 98% confidence). The zero-shot filter correctly distinguishes photographs of artwork from photographs of people.
| Configuration | Effect |
|---|---|
| SigLIP + MLP only | 94.4% base accuracy |
| + Zero-shot verification | Reranking on ~3% of images; corrects attribute mismatches |
| + Compositional reasoning | Targeted corrections in multi-figure scenes |
| + Subject filter | Eliminates false positives on non-art inputs |
| + Memory recall | Advisory signal for uncertain predictions (cosine ≥ 0.88) |
| + Knowledge graph enrichment | Adds explainability metadata; no accuracy change |
| + Provenance / Iconometry / Style | Adds art-historical context (zero-shot); no accuracy change |
Table 6. Ablation study showing the contribution of each pipeline component.
The primary strength of TermaVision lies in its combination of high accuracy (94.4% top-1) with full explainability and holistic art-historical analysis. Each identification is accompanied by a structured process log showing the SigLIP embedding, MLP classification scores, zero-shot verification results, knowledge graph attributes, confidence calibration, provenance assessment, iconometric analysis, and regional style classification. This transparency is essential for scholarly applications where users must evaluate both the identity of a figure and the art-historical context of the work.
The system faces several limitations. First, the 93-class training set covers only a fraction of the estimated 557+ distinct iconographic forms in the Tibetan Buddhist tradition. The zero-shot fallback partially addresses this gap by comparing against text embeddings of all 557 deities, but text-only embeddings cannot capture the visual subtleties that distinguish closely related forms. Second, multi-figure detection relies on Grounding DINO, which struggles with wrathful deities whose visual features (flames, multiple heads, distorted proportions) diverge significantly from the “person” and “figure” concepts in the model’s training data. Third, the system was designed for Buddhist art and should not be applied to identify living persons — a constraint enforced by the subject filter but not guaranteed against adversarial inputs.
An instructive failure case arose when we attempted to add historical Buddhist masters (the 14th Dalai Lama, the 5th Dalai Lama, Situ Panchen Chokyi Jungne) to the training set. With only 3–4 master classes trained on 20–40 images each, the model exhibited a “magnet effect”: every photograph of any Tibetan monk or lama was confidently classified as one of the few known masters. This occurred because: (1) photographs of monks share strong visual features (maroon robes, yellow shirt, seated posture) that dominate over the subtle facial differences distinguishing individuals, and (2) the softmax classifier must distribute probability mass across known classes, creating false confidence when the true class is absent from the training set.
This experience informed two design decisions: (1) historical masters were removed from the training set until a critical mass of diverse master classes (estimated at 15+ individuals with 20+ images each) can be collected, and (2) the zero-shot subject filter was implemented to prevent photographs of real people from reaching the classifier at all.
TermaVision represents a step toward democratizing access to Buddhist iconographic expertise. By automating the initial identification of deity figures — with transparent evidence supporting each determination — the system can assist museum curators in cataloging uncategorized collections, support scholars in analyzing compositional programs across multiple artworks, and enable practitioners to identify figures in monastery murals and personal shrines. The structured knowledge graph format facilitates integration with existing cultural heritage databases and linked data initiatives.
We presented TermaVision, a multi-stage deep learning pipeline for automated Buddhist iconography identification that achieves 94.4% top-1 accuracy across 93 classes while providing structured explainability through a 557-deity knowledge graph. The system processes images in 1.3–4 seconds on a consumer GPU, operates entirely offline, and costs nothing per inference — representing a significant practical improvement over the prior GPT-4o-based approach.
Key architectural decisions — frozen SigLIP features with a lightweight MLP head, zero-shot attribute verification, compositional reasoning, a subject detection pre-filter, and zero-shot artifact analysis (provenance, iconometry, regional style) — address the unique challenges of Buddhist iconography: extreme intra-class variation, subtle inter-class distinctions, compositional context dependencies, and the requirement for scholarly explainability. The zero-shot artifact analysis modules provide art-historical context — medium classification, proportional assessment via the traditional Angula system, and regional school attribution — without requiring any additional training data.
Future work will focus on expanding the training set beyond 93 classes toward comprehensive coverage of the 557-deity knowledge graph, improving wrathful deity detection through specialized prompting or fine-tuning of the detection model, and developing a collaborative annotation platform to engage the scholarly community in curating training data.
Conceptualization, methodology, software, validation, data curation, and writing by Thupten N. Chakrishar. This research received no external funding. Development was supported by the Terma Heritage Foundation, Inc.