Terma Heritage Foundation
AboutProgramsImpactTeamNews & PressContactSupport Our Work
Terma Heritage Foundation

Preserving Tibetan and Himalayan cultural heritage through technology, arts, education, and community programs.

Foundation

  • About
  • Impact
  • Team
  • News & Press
  • Contact

Programs

  • TermaVision
  • TermaFoundry
  • Gangjong Doeghar
  • Sacred Audio
  • Terma Studio
  • View All Programs

Connect

  • Email

© 2026 Terma Heritage Foundation, Inc. | New York Not-for-Profit Corporation

Privacy Policy
All News
April 11, 2026Program Updates

TermaVision: A Multi-Stage Deep Learning Pipeline for Automated Buddhist Iconography Identification

Thupten N. Chakrishar

Terma Heritage Foundation, Inc., New York, NY, USA

Correspondence: info@termafoundation.org

Abstract

Buddhist art identification — the task of determining which deity, bodhisattva, or sacred figure is depicted in a thangka painting, statue, or mural — remains a specialized skill requiring years of training in Tibetan iconography. This paper presents TermaVision, an automated multi-stage pipeline that combines frozen vision-language features, a lightweight classifier, zero-shot attribute verification, and a structured iconography knowledge graph to identify Buddhist figures with 94.4% top-1 accuracy across 93 classes. The system employs SigLIP-so400m as a frozen feature extractor, feeding 1152-dimensional embeddings into a three-layer MLP classifier trained on approximately 5,500 images with 5× data augmentation. A zero-shot subject filter using SigLIP’s text encoder distinguishes Buddhist artwork from photographs of real people, preventing confident misclassification on out-of-distribution inputs. For multi-figure compositions, Grounding DINO provides zero-shot figure detection, and a compositional reasoning module encodes canonical Buddhist groupings (e.g., Rigsum Gonpo, Five Dhyani Buddhas) to correct cross-figure misidentifications. Attribute verification via SigLIP text-image similarity cross-checks predicted identities against a knowledge graph of 557 deities with structured iconographic attributes (body color, number of arms, held objects, mudras). Beyond identification, the pipeline includes zero-shot provenance classification, iconometric proportion assessment based on the traditional Tibetan Angula system, regional artistic style classification, and nearest-neighbor visual memory recall for uncertainty reduction. The complete pipeline processes a single image in 1.3–4 seconds on a consumer GPU.

Buddhist art identificationdeep learningzero-shot classificationknowledge graphiconographycultural heritagevision-language modelsthangka

1. Introduction

Tibetan Buddhist art encompasses one of the world’s most complex iconographic traditions. A single thangka painting may contain dozens of distinct deity figures, each identified by a precise combination of body color, number of arms and faces, hand gestures (mudras), held objects (such as vajras, lotuses, or skull cups), crown type, posture, and accompanying consort or mount. Correct identification requires extensive training in Tibetan Buddhist iconography — knowledge that is concentrated among a diminishing number of scholars and practitioners worldwide.

While several digital repositories of Buddhist art exist, expert curation remains a bottleneck: the vast majority of Buddhist artworks in private collections, monasteries, and museums remain unidentified or misidentified. This gap in expert identification represents a significant barrier to cultural heritage preservation, provenance research, and scholarly analysis.

Recent advances in vision-language models have demonstrated remarkable zero-shot capabilities across diverse visual domains. However, Buddhist iconography presents unique challenges that distinguish it from standard image classification: (1) extreme intra-class variation, as the same deity can appear in peaceful, semi-wrathful, and wrathful forms across painting, statue, and mural media; (2) subtle inter-class distinctions, where deities are differentiated by specific combinations of attributes rather than gross visual differences; (3) compositional context dependencies, where the identity of one figure constrains the identity of adjacent figures in known groupings; and (4) the critical requirement for explainability, as scholars require not merely a label but evidence supporting the identification.

Several prior works have explored deep learning for Thangka analysis. Ma et al. constructed a Tibetan Thangka dataset and addressed tasks including figure detection and segmentation. Tang and Xie proposed a contrastive learning approach for classifying Thangka cultural elements. Xue et al. applied DenseNet with squeeze-and-excitation networks for Yidam (meditational deity) classification. However, these works focus on narrow subsets of the iconographic tradition and do not address the full pipeline of detection, classification, attribute verification, and compositional reasoning required for practical deployment.

In this paper, we present TermaVision, a comprehensive multi-stage pipeline designed for production use by the Terma Heritage Foundation. Our contributions are:

  1. A multi-stage identification pipeline combining frozen vision-language features (SigLIP), zero-shot object detection (Grounding DINO), a trained MLP classifier, zero-shot attribute verification, and compositional reasoning, achieving 94.4% top-1 accuracy on 93 classes spanning deities, bodhisattvas, protectors, arhats, historical masters, and architectural forms (stupas).
  2. A structured iconography knowledge graph containing 557 deity entries with nine attribute categories (body color, expression, number of arms, number of faces, posture, crown type, held objects, mudras, and hard identifiers), enabling explainable verification of predicted identities.
  3. A zero-shot art-versus-person subject filter that prevents confident misclassification of out-of-distribution inputs (photographs of real people) by leveraging SigLIP’s text-image alignment capabilities.
  4. A compositional reasoning module encoding canonical Buddhist figure groupings that applies logit-space adjustments to correct cross-figure misidentifications in multi-figure compositions.
  5. Zero-shot artifact analysis modules — provenance classification (medium, age, quality tier), iconometric proportion assessment based on the traditional Tibetan Angula measurement system, and regional artistic style classification — providing holistic art-historical context beyond identity alone.
  6. A nearest-neighbor visual memory recall mechanism that queries the training feature cache for uncertainty reduction on ambiguous classifications.
  7. An analysis of the practical trade-offs between a large language model baseline (GPT-4o) and the proposed specialized pipeline across speed, cost, accuracy, and explainability dimensions.

2. Related Work

2.1. Vision-Language Models for Visual Recognition

The CLIP model introduced contrastive language-image pre-training, demonstrating that vision-language models trained on large-scale web data acquire transferable visual representations applicable to diverse downstream tasks. SigLIP replaced CLIP’s softmax-based contrastive loss with a sigmoid loss, enabling more efficient batch construction and improved performance at equivalent model scales. The SigLIP-so400m variant, which we employ, uses a SoViT-400m architecture with 400 million parameters and processes images at 384×384 resolution, producing 1152-dimensional embeddings.

These models enable two capabilities critical to our pipeline: (1) high-quality frozen feature extraction, where the vision encoder produces discriminative embeddings without task-specific fine-tuning, and (2) zero-shot text-image similarity scoring, where arbitrary textual descriptions can be compared against image regions for attribute verification.

2.2. Zero-Shot Object Detection

Grounding DINO combines the DINO self-supervised vision transformer with grounded pre-training to enable open-set object detection via textual prompts. Given an input image and a text description such as “buddhist deity . figure . statue,” the model localizes all matching regions with bounding boxes and confidence scores. This capability is essential for multi-figure thangka analysis, where the number and arrangement of figures varies across compositions.

2.3. Deep Learning for Cultural Heritage

Within Buddhist art specifically, the Tibetan Thangka dataset of Ma et al. established benchmarks for figure detection and segmentation. Tang and Xie achieved improved classification of Thangka cultural elements using self-supervised contrastive learning with multi-scale triplet attention. Castellano et al. demonstrated the value of combining knowledge graphs with deep learning for automated art analysis, an approach conceptually related to our iconography database integration.

2.4. Knowledge Graphs for Iconography

The ICON ontology formalized artistic iconographic interpretation as a structured knowledge representation problem, establishing that visual symbols in art carry culturally-determined meanings that can be encoded as ontological relationships. Our iconography knowledge graph follows a similar principle, encoding the structured relationships between Buddhist deity identities and their defining visual attributes (e.g., Vajrapani is always blue, always holds a vajra, and displays a wrathful expression).

3. Methods

3.1. System Overview

TermaVision processes an input image through multiple sequential stages. Each stage produces structured output that is presented to the user as a transparent process log, ensuring that every identification decision is explainable and auditable.

Figure 1. System Architecture
Stage 0
Subject Filter (Zero-Shot)
SigLIP text-image similarity — "Buddhist art" vs "photograph of a real person"
↓
Stage 1
Figure Detection
Grounding DINO zero-shot object detection → bounding boxes + confidence scores
↓
Stage 2
Feature Extraction + Classification
SigLIP-so400m frozen encoder → 1152-dim embeddings → 3-layer MLP (1152 → 768 → 384 → 93)
↓
Stage 3
Zero-Shot Verification
Attribute verification via SigLIP text encoder + compositional reasoning (8 canonical groupings)
↓
Stage 4
Validation + Knowledge Graph
Confidence calibration, memory recall, iconography DB lookup (557 deities, 9 attribute categories)
↓
Stage 5
Artifact Analysis (Zero-Shot)
Provenance classification, iconometric assessment (Angula system), regional style (7 traditions)

3.2. Subject Detection Pre-Filter

A critical practical challenge arises from the closed-set nature of the MLP classifier: when presented with an out-of-distribution input (e.g., a photograph of a person), the softmax output still sums to 1.0, often producing high-confidence predictions for incorrect classes. We address this with a zero-shot pre-filter leveraging SigLIP’s text-image alignment.

Six text prompts are encoded via the SigLIP text encoder and cached: four art prompts (“a thangka painting of a Buddhist deity,” “a Buddhist statue or sculpture,” “a Buddhist mural or wall painting,” “a drawing or illustration of a Buddhist figure”) and two person prompts (“a photograph of a real person,” “a photograph of a person in robes”). If the person score exceeds the art score, the pipeline returns “Unidentified” immediately, with the process log transparently indicating that the subject was classified as a real person rather than Buddhist artwork.

3.3. Multi-Figure Detection

For complex compositions containing multiple deity figures, Grounding DINO provides zero-shot bounding box detection using the text prompt “buddhist deity . figure . buddha . person . statue . wrathful deity . demon . protector” with a box threshold of 0.25 and text threshold of 0.20.

The decision to use multi-figure mode versus whole-image classification is governed by three conditions that must all be satisfied: (1) at least two confident detections (score ≥ 0.30), (2) at least one high-confidence detection (score ≥ 0.45), and (3) detections collectively cover more than 40% of the image area. When these conditions are not met, the system falls back to whole-image classification, which empirically achieves higher accuracy on images dominated by a single figure.

3.4. Feature Extraction and Classification

Each image (or detected crop) is resized to 384×384 pixels and processed through the frozen SigLIP-so400m vision encoder to obtain a 1152-dimensional embedding. This embedding is then classified by a three-layer MLP:

f(x) = Linear384→C(GELU(Linear768→384(GELU(Linear1152→768(LayerNorm(x))))))

where C = 93 is the number of output classes. Dropout is applied after each GELU activation (rates of 0.3 and 0.15 for the first and second hidden layers, respectively). The classifier contains approximately 1.2 million parameters — three orders of magnitude fewer than the frozen feature extractor.

Training Data.The training set comprises approximately 5,500 images sourced from curated digital repositories of Buddhist art and locally collected images from museum catalogs and web sources. Images are filtered by classification type (deity, bodhisattva, dharma king, teacher, arhat, protector). A label normalization pipeline maps diverse naming conventions (e.g., “Guru Rinpoche,” “Padmasambhava,” “Padma Jungne”) to canonical class names via a manually curated mapping of over 200 entries.

Data Augmentation.Each training image produces five feature vectors through the following augmentation pipeline: (1) clean resize, (2) grayscale conversion, (3) color jitter (brightness ±0.5, contrast ±0.5, saturation ±0.7, hue ±0.15), (4) random resized crop (scale 0.6–1.0) with rotation (±20°) and random erasing, and (5) a combined harsh augmentation applying grayscale, color jitter, and horizontal flip. This 5× augmentation produces approximately 27,500 training vectors from 5,500 source images.

Class Weighting. To address severe class imbalance (the largest class contains 2,035 vectors while the smallest contains 20), we apply log-inverse-document-frequency weighting, where each class weight is computed as the log of the ratio of total samples to class samples, normalized to unit mean.

Training Procedure.The MLP is trained for 60 epochs using AdamW optimization with a learning rate of 1×10⁻³, weight decay of 0.01, and a linear warmup over the first 5 epochs. The training/validation split is 85%/15% with stratification to ensure every class is represented in both splits. The best model is selected by validation top-1 accuracy.

3.5. Zero-Shot Attribute Verification

After the MLP produces top-k predictions, the attribute verification module cross-checks each candidate against the iconography knowledge graph. For a predicted deity with known attributes (e.g., Vajrapani: blue body, wrathful expression, holds vajra), targeted text prompts are generated and scored against the image via SigLIP text-image similarity.

Prompt templates are designed for specific attribute categories: body color (“a Buddhist deity with [color] colored skin/body”), held objects (“a Buddhist deity holding a [object]”), crown (“a Buddhist deity wearing a [crown type]”), expression (“a [wrathful/peaceful] Buddhist deity”), and negative constraints (“a Buddhist deity without a [attribute]”). Each prompt produces a similarity score. Matched attributes accumulate verification evidence. When the verification evidence favors a lower-ranked MLP prediction over the top-1 prediction, reranking occurs.

3.6. Compositional Reasoning

Buddhist art frequently depicts figures in canonical groupings whose membership is doctrinally determined. We encode eight such groupings:

GroupingMembers
Rigsum Gonpo (Three Protectors)Avalokiteshvara, Manjushri, Vajrapani
Tse Lha Nam Sum (Three Long-Life Deities)Amitayus, White Tara, Ushnishavijaya
Je Yab Sey Sum (Tsongkhapa Triad)Tsongkhapa, Manjushri, Vajrapani
Khen-Lop-Cho SumShantarakshita, Padmasambhava, Trisong Detsen
Eight Great BodhisattvasAvalokiteshvara, Manjushri, Vajrapani, Kshitigarbha, Maitreya, Samantabhadra, Akashagarbha, Sarvanirvarana-Vishkambhin
Six Ornaments of IndiaNagarjuna, Aryadeva, Asanga, Vasubandhu, Dignaga, Dharmakirti
Five Dhyani BuddhasVairocana, Akshobhya, Ratnasambhava, Amitabha, Amoghasiddhi
Three Nyingma ProtectorsEkajati, Rahula, Dorje Legpa

Table 1. Eight canonical Buddhist groupings encoded in the compositional reasoning module.

When the multi-figure pipeline detects an anchor deity (confidence > 70%) that belongs to a known grouping, companion classes receive a logit-space boost while known confusions are suppressed. For example, if Avalokiteshvara is detected at high confidence and an adjacent figure is ambiguously classified as either Manjushri or Maitreya, the Rigsum Gonpo grouping boosts Manjushri (the expected companion) relative to Maitreya.

3.7. Iconography Knowledge Graph

The knowledge graph contains 557 deity entries organized by category with structured attributes at high coverage rates.

CategoryCount
Buddhas70
Yidams (Meditational Deities)130
Bodhisattvas39
Protectors (Dharmapalas)72
Masters / Mahasiddhas175
Arhats18
Dakinis8
Stupas10
Other (Wealth Deities, Guardians, etc.)35
Total557

Table 2. Distribution of deity entries in the iconography knowledge graph by category.

AttributeCoverage
Body color530/557 (95%)
Number of arms484/557 (87%)
Posture481/557 (86%)
Number of faces468/557 (84%)
Expression454/557 (82%)
Crown type422/557 (76%)
Held objects409/557 (73%)
Mudras261/557 (47%)
Hard identifiers20/557 (4%)

Table 3. Attribute coverage in the iconography knowledge graph.

3.8. Visual Memory Recall

When the MLP classifier produces ambiguous or low-confidence predictions, a nearest-neighbor visual memory module provides a secondary identification signal. The module loads the full training feature cache (~27,500 normalized 1152-dimensional vectors) and computes cosine similarity between the query embedding and all cached embeddings. If the maximum similarity exceeds a threshold of 0.88, the top-3 nearest training examples are returned as advisory matches. Memory recall operates strictly as a secondary signal — it does not override confident pipeline predictions but provides additional evidence when the primary classifier is uncertain.

3.9. Zero-Shot Provenance Classification

Beyond deity identification, TermaVision provides art-historical context through zero-shot provenance analysis. Using SigLIP text-image similarity with curated prompt sets, the provenance module classifies three facets of the artwork:

  1. Medium:painting (thangka), statue (bronze, gilt metal), stone relief, mural, print (modern reproduction), textile (silk appliqué), or woodblock print.
  2. Provenance: antique (pre-20th century original), modern print (mechanical reproduction), hand-painted (contemporary traditional), or undetermined.
  3. Quality tier (Thigse): Following the Tibetan classification system, artworks are assessed as Thig-zang (court quality, with fine gold line work and precise proportions) or Thig-ngen (folk art, with simpler execution and regional character).

3.10. Iconometric Assessment

The iconometry module evaluates the proportional quality of the depicted figure based on the canonical Tibetan measurement system codified in texts such as Gega Lama’s Principles of Tibetan Artand Desi Sangye Gyatso’s 17th-century Handbook of Tibetan Iconometry. The Angula system prescribes distinct proportion models for different classes of figures:

Figure TypeFace-LengthsAngulasTypical Subjects
Buddha10120Shakyamuni, Amitabha, Medicine Buddha
Bodhisattva9108Avalokiteshvara, Manjushri, Tara
Goddess / Dakini896Vajrayogini, Saraswati, Kurukulla
Wrathful Deity672Mahakala, Vajrabhairava, Hayagriva

Table 3a. Canonical Angula proportion models used in iconometric assessment.

3.11. Regional Style Classification

The regional style module classifies the artistic school and approximate era of the artwork, again using zero-shot SigLIP text-image similarity. Seven painting traditions are distinguished:

  1. Menri— strong mineral colors, elaborate rocky landscapes, Tsang region
  2. Karma Gadri— misty atmospheric landscapes, muted palette, Eastern Tibet
  3. U-Tsang— deep blue backgrounds, extensive gold line work, Central Tibet
  4. Newari— warm red-orange palette, ornate decorative borders, Kathmandu Valley
  5. Chinese-influenced— Han Chinese aesthetic, Qing dynasty motifs, silk brocade borders
  6. Khyenri— transitional Sino-Tibetan hybrid, 15th-century origins
  7. Eastern Tibetan— bold colors, folk art elements, Amdo and Kham regions

Era classification spans four periods: pre-15th century (Pala Indian influence), 15th–17th century (classical golden age), 18th–19th century (mature Karma Gadri with Qing influence), and 20th century–modern (synthetic pigments, revival styles).

4. Experiments

4.1. Dataset

The training dataset comprises images from two sources: (1) curated digital repositories (approximately 12,000 images, filtered to approximately 4,800 images across 93 classes after label normalization) and (2) locally collected images (approximately 700 images from museum catalogs, web sources, and direct photography). The combined dataset of approximately 5,500 images yields 27,500 feature vectors after 5× augmentation. The class distribution is heavily imbalanced: the largest class contains 407 source images while the smallest contains 4, with a mean of 59 images per class.

4.2. Classification Results

MetricScore
Top-1 Accuracy94.4%
Top-3 Accuracy98.3%
Top-5 Accuracy99.5%
Number of Classes93
Training Vectors~23,400 (85%)
Validation Vectors~4,100 (15%)

Table 4. Classification performance on the validation set.

The training curve shows rapid convergence in the first 10 epochs followed by gradual improvement, reaching peak validation accuracy at epoch 56.

4.3. Comparison with GPT-4o Baseline

Prior to developing TermaVision v4, the Terma Heritage Foundation deployed a GPT-4o-based identification system (v3) that processes images through the Azure OpenAI API with a detailed prompt containing iconographic instructions.

MetricGPT-4o (v3)TermaVision (v4)
Inference time8–35 seconds1.3–4 seconds
Cost per image$0.01–0.03Free (local GPU)
Top-1 accuracy (common deities)HigherLower
Top-1 accuracy (rare/specific figures)LowerHigher
Multi-figure detectionDescriptive (no boxes)Bounding boxes + per-figure ID
Offline capableNoYes
ExplainabilityNatural languageStructured attributes + scores
ScalabilityRate-limited, costlyUnlimited, free

Table 5. Comparison between the GPT-4o baseline (v3) and the proposed TermaVision pipeline (v4).

The GPT-4o system excels at common deities with extensive internet documentation (Shakyamuni Buddha, Green Tara, Padmasambhava) but demonstrates systematic failures on specific Tibetan historical figures. For example, GPT-4o consistently misidentifies Marpa as Milarepa — a confusion that the MLP classifier, trained on curated iconographic data, avoids entirely. Conversely, the GPT-4o system better handles deities not present in the 93-class training set, benefiting from its broad pre-training knowledge.

4.4. Subject Filter Evaluation

The zero-shot art-versus-person filter was evaluated on a qualitative test set comprising photographs of Buddhist artwork (thangkas, statues, murals) and photographs of Buddhist monks and teachers. Without the filter, the MLP classifier assigns high-confidence labels to photographs of real people (e.g., classifying a photograph of a Tibetan lama as “Stupa – Descent From Heaven” at 98% confidence). The zero-shot filter correctly distinguishes photographs of artwork from photographs of people.

4.5. Ablation Study

ConfigurationEffect
SigLIP + MLP only94.4% base accuracy
+ Zero-shot verificationReranking on ~3% of images; corrects attribute mismatches
+ Compositional reasoningTargeted corrections in multi-figure scenes
+ Subject filterEliminates false positives on non-art inputs
+ Memory recallAdvisory signal for uncertain predictions (cosine ≥ 0.88)
+ Knowledge graph enrichmentAdds explainability metadata; no accuracy change
+ Provenance / Iconometry / StyleAdds art-historical context (zero-shot); no accuracy change

Table 6. Ablation study showing the contribution of each pipeline component.

5. Discussion

5.1. Strengths and Limitations

The primary strength of TermaVision lies in its combination of high accuracy (94.4% top-1) with full explainability and holistic art-historical analysis. Each identification is accompanied by a structured process log showing the SigLIP embedding, MLP classification scores, zero-shot verification results, knowledge graph attributes, confidence calibration, provenance assessment, iconometric analysis, and regional style classification. This transparency is essential for scholarly applications where users must evaluate both the identity of a figure and the art-historical context of the work.

The system faces several limitations. First, the 93-class training set covers only a fraction of the estimated 557+ distinct iconographic forms in the Tibetan Buddhist tradition. The zero-shot fallback partially addresses this gap by comparing against text embeddings of all 557 deities, but text-only embeddings cannot capture the visual subtleties that distinguish closely related forms. Second, multi-figure detection relies on Grounding DINO, which struggles with wrathful deities whose visual features (flames, multiple heads, distorted proportions) diverge significantly from the “person” and “figure” concepts in the model’s training data. Third, the system was designed for Buddhist art and should not be applied to identify living persons — a constraint enforced by the subject filter but not guaranteed against adversarial inputs.

5.2. The Master Identification Problem

An instructive failure case arose when we attempted to add historical Buddhist masters (the 14th Dalai Lama, the 5th Dalai Lama, Situ Panchen Chokyi Jungne) to the training set. With only 3–4 master classes trained on 20–40 images each, the model exhibited a “magnet effect”: every photograph of any Tibetan monk or lama was confidently classified as one of the few known masters. This occurred because: (1) photographs of monks share strong visual features (maroon robes, yellow shirt, seated posture) that dominate over the subtle facial differences distinguishing individuals, and (2) the softmax classifier must distribute probability mass across known classes, creating false confidence when the true class is absent from the training set.

This experience informed two design decisions: (1) historical masters were removed from the training set until a critical mass of diverse master classes (estimated at 15+ individuals with 20+ images each) can be collected, and (2) the zero-shot subject filter was implemented to prevent photographs of real people from reaching the classifier at all.

5.3. Implications for Cultural Heritage Preservation

TermaVision represents a step toward democratizing access to Buddhist iconographic expertise. By automating the initial identification of deity figures — with transparent evidence supporting each determination — the system can assist museum curators in cataloging uncategorized collections, support scholars in analyzing compositional programs across multiple artworks, and enable practitioners to identify figures in monastery murals and personal shrines. The structured knowledge graph format facilitates integration with existing cultural heritage databases and linked data initiatives.

6. Conclusions

We presented TermaVision, a multi-stage deep learning pipeline for automated Buddhist iconography identification that achieves 94.4% top-1 accuracy across 93 classes while providing structured explainability through a 557-deity knowledge graph. The system processes images in 1.3–4 seconds on a consumer GPU, operates entirely offline, and costs nothing per inference — representing a significant practical improvement over the prior GPT-4o-based approach.

Key architectural decisions — frozen SigLIP features with a lightweight MLP head, zero-shot attribute verification, compositional reasoning, a subject detection pre-filter, and zero-shot artifact analysis (provenance, iconometry, regional style) — address the unique challenges of Buddhist iconography: extreme intra-class variation, subtle inter-class distinctions, compositional context dependencies, and the requirement for scholarly explainability. The zero-shot artifact analysis modules provide art-historical context — medium classification, proportional assessment via the traditional Angula system, and regional school attribution — without requiring any additional training data.

Future work will focus on expanding the training set beyond 93 classes toward comprehensive coverage of the 557-deity knowledge graph, improving wrathful deity detection through specialized prompting or fine-tuning of the detection model, and developing a collaborative annotation platform to engage the scholarly community in curating training data.

Author & Funding

Conceptualization, methodology, software, validation, data curation, and writing by Thupten N. Chakrishar. This research received no external funding. Development was supported by the Terma Heritage Foundation, Inc.

References

  1. Beer, R. The Handbook of Tibetan Buddhist Symbols; Serindia Publications: Chicago, IL, 2003.
  2. Zhai, X.; Mustafa, B.; Kolesnikov, A.; Beyer, L. Sigmoid Loss for Language Image Pre-Training. ICCV, 2023.
  3. Radford, A.; Kim, J.W.; Hallacy, C. et al. Learning Transferable Visual Models From Natural Language Supervision. ICML, 2021.
  4. Ma, Y.; Liu, Y.; Xie, Q.; Xiong, S. et al. A Tibetan Thangka Data Set and Relative Tasks. Image Vis. Comput. 2021, 108, 104125.
  5. Tang, W.; Xie, Q. A Thangka Cultural Element Classification Model Based on Self-Supervised Contrastive Learning and MS Triplet Attention. Vis. Comput.2024, 40, 3919–3935.
  6. Xue, P. et al. Thangka Yidam Classification Based on DenseNet and SENet. J. Electron. Imaging 2022, 31(4), 043039.
  7. Liu, S.; Zeng, Z.; Ren, T. et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. ECCV, 2024.
  8. Fiorucci, M.; Khoroshiltseva, M.; Pontil, M. et al. Machine Learning for Cultural Heritage: A Survey. Pattern Recognit. Lett.2020, 133, 102–108.
  9. Castellano, G.; Digeno, V.; Sansaro, G.; Vessio, G. Leveraging Knowledge Graphs and Deep Learning for Automatic Art Analysis. Knowl.-Based Syst. 2022, 248, 108859.
  10. Sartini, B.; Baroncini, S.; van Erp, M. et al. ICON: An Ontology for Comprehensive Artistic Interpretations. ACM J. Comput. Cult. Herit. 2023, 16(3), Article 59.
Back to TermaVision