Semalith v1.5: A Purpose-Built Safety Classifier for AI in Financial Services

Tejasvi A
4 hours ago
5 min read

By Dr. Tejasvi Addagada

Every major bank deploying a GenAI system faces a version of the same problem. The AI assistant is capable. The business case is clear. But before any compliance officer signs off on a production rollout, someone in the room asks the question that stops most projects cold: how do we know it is safe?

Not safe in the abstract. Safe in the specific sense that the AI will not be manipulated into disclosing confidential data, generating unlicensed investment advice, or being commandeered by a prompt embedded in a customer document to act against the institution's interest. Safe in the sense that a regulator could inspect the system, review its access controls, and confirm that prompt injection — the single most well-documented attack vector against production AI systems — has been accounted for.

Most guardrail products available today cannot answer that question for a financial services context. They were built for a different problem

Why General-Purpose Guardrails Are Not Enough

The dominant safety classifiers in the market — LlamaGuard from Meta, Granite Guardian from IBM, WildGuard from Allen AI, ShieldGemma from Google — are built around a general harm taxonomy. They excel at identifying toxic content, harmful instructions, and policy violations across a broad range of consumer and enterprise applications. For many use cases, that is sufficient.

For BFSI deployments, it creates a structural mismatch.

A gold loan chatbot, a credit advisory assistant, a regulatory reporting agent — these systems operate in a domain where the boundary between legitimate and harmful is defined by regulation, not by general harm categories. The prompt "What threshold applies to cash transactions under FATF Recommendation 10?" is a legitimate compliance query when a compliance officer asks it and a potential AML bypass probe when an adversary does. General-purpose classifiers have no way to make that distinction.

The result is one of two failure modes: excessive blocking of legitimate financial queries — high false positive rates that destroy the user experience — or insufficient detection of domain-specific attacks. Zhang and Ren documented this in 2025: domain shift from general to financial-services prompts raises false positive rates by 15 to 40 percentage points across LlamaGuard, WildGuard, and comparable models. That is not a minor calibration problem. It is a structural gap between what these systems were built for and what regulated financial institutions need.

Building Semalith

Semalith v1.5 is a 184-million-parameter safety classifier built on DeBERTa-v3-base, trained on 70,500 real-world examples across a 22-class taxonomy that simultaneously covers three safety axes: prompt injection detection, general harm classification, and financial-services regulatory compliance. The library and model card is available at https://huggingface.co/Tejasvi-addagada/semalith-v1.5

The 22 labels span nine prompt injection sub-types — from system override and jailbreak to agentic injection and indirect injection — alongside eleven BFSI-specific domains covering banking, cards, payments, loans, insurance, fraud, AML/sanctions, unlicensed financial advice, and regulatory enquiries. A 4-class auxiliary super-category head runs in parallel, trained under jointly weighted loss to prevent sub-class collapse in low-data label regions.

The design philosophy is deliberate: one model, one forward pass, 11.6 milliseconds. No external API calls. No data leaving the institution's network. Deployable within the security perimeter of any regulated entity without modification.

The training corpus is exclusively real-world data — no synthetic prompts, no templated augmentation. Every row is SHA-1 deduplicated against the 22 held-out evaluation benchmarks used to measure performance, providing a contamination audit trail that most published safety classifiers do not document.

What the Benchmarks Show

On prompt injection detection — the primary design target — Semalith v1.5 achieves results that exceed models 43 times its size by wide margins.

On HackaPrompt, a competition dataset of 1,501 real-world prompt injection attacks, Semalith achieves F1 0.997. LlamaGuard-3-8B scores approximately 0.09 on the same benchmark. Granite Guardian 3.3 scores approximately 0.18. Both are eight-billion-parameter models.

The pattern holds across the Mosscap benchmark, which tests increasingly sophisticated defended-LLM extraction attacks across three difficulty tiers. Semalith achieves F1 0.945 on Mosscap L6, 0.934 on L7, and 0.887 on L8. Granite Guardian scores 0.111, 0.113, and 0.085 respectively on the same tiers. This is not marginal. It reflects the fundamental training objective mismatch described earlier.

On agentic false positive rate — arguably the most operationally critical metric for financial services deployments — Semalith produces a 0.5% false positive rate on 208 benign agentic tasks from the AgentHarm benchmark. LlamaGuard-3-8B produces a 6.3% false positive rate on the same set. A 12× difference in false positive rate is the difference between a system that operates invisibly and one that generates constant intervention queues.

On the FINPROOF adversarial benchmark — an open evaluation standard covering seven regulatory harm categories across 6,283 domain-specific prompts — Semalith ranks third among five evaluated models with a category recall of 0.527. Granite Guardian leads at 0.695, followed by ShieldGemma at 0.578. LlamaGuard and WildGuard trail at 0.397 and 0.209.

That ranking is worth sitting with. The 8B models lead on BFSI domain classification. At 184 million parameters, Semalith cannot match their capacity for fine-grained regulatory domain categorisation. That is an honest limitation and it is documented as such. But the comparison framework matters. The operational question is which combination of detection capability, false positive rate, latency, and deployment footprint fits the environment — and for most BFSI production AI deployments, that profile favours Semalith's operating point.

Regulatory Grounding

Semalith's 22-class taxonomy was designed with regulatory mapping as a first principle, not an afterthought.

The EU AI Act Article 9(4) mandates adversarial testing for high-risk AI systems in credit scoring, insurance underwriting, and AML compliance before deployment. SR 11-7 requires model validation to be documented, reproducible, and independently verifiable — the contamination audit, training manifest, and evaluation harness are all publicly released to support that standard. ISO/IEC 42001:2023 requires documented risk controls for AI management systems; Semalith's per-label regulatory anchors map directly to supervisory reporting categories.

The eleven B-series labels are not just classification categories — they are audit primitives. A query classified as B-11 can trigger an AML review flag without additional post-processing. A B-09 output can route to a compliance escalation queue. The audit trail is structural, not retrofitted.

This is the distinction between a general safety classifier adapted for financial services and a classifier built from the ground up for the regulatory obligations that financial institutions actually face.

Access and What Comes Next

Semalith v1.5 is available on HuggingFace at huggingface.co/Tejasvi-addagada/semalith-v1.5 under the Semalith Research Access License. Model weights are gated for research access — academic researchers, independent safety evaluators, and non-commercial practitioners can request access directly through the page. The evaluation harness, held-out benchmark suite, and contamination audit are publicly available for independent replication without access approval.

Version 1.6 is in active development, targeting improved FINPROOF category recall through domain-specific corpus expansion and a B-11 sub-label split separating AML compliance queries from wealth management queries.

The accompanying research paper is being submitted to arXiv and will be linked from the model page on publication.

A Note on What Semalith Is Not

Semalith detects adversarial inputs. It does not evaluate model outputs. It does not reason about context across multi-turn conversations beyond a 512-token window. It does not cover non-English languages. It is not a complete AI governance stack — it is one component of one, the input-layer guard that catches the attack before it reaches the model.

Deploying Semalith without an output-layer classifier, a response audit system, and a human escalation path for edge cases would still leave meaningful exposure. The right use of a tool like this is as part of a layered defence.

The institutions that deploy AI safely in financial services will be the ones that treat safety classification as an engineering discipline — with the same rigour they apply to credit model validation, fraud detection thresholds, and data lineage documentation. Semalith is built to support that discipline.

Semalith v1.5 is available for research access at huggingface.co/Tejasvi-addagada/semalith-v1.5. For questions or collaboration, contact tejasvi@tejasviaddagada.com.