Overview
SNIP is a compact text classifier for pasted text, snippets, logs, configuration files, and text-like source files. The release model predicts 28 source and text labels from bounded text samples using hashed character n-grams and a sparse linear classifier. It is intended for pastebin sites, editors, and developer tools that need a useful syntax suggestion immediately after text is pasted without sending the pasted content to a server. The model is embedded in a dependency-free TypeScript runtime and released under the BSD 3-Clause license.Problem
Pastebin sites, editors, and developer tools often need a useful syntax suggestion immediately after text is pasted. SNIP is designed for that narrow workflow: fast browser-local inference without sending pasted content to a server. It predicts a likely syntax or text label for pasted text, snippets, logs, configuration files, and text-like source files. Binary file detection, MIME sniffing, and full file-type forensics are outside the current scope.
Method
The release model is a multiclass linear classifier trained with passive-aggressive updates. Input text is sampled, converted to hashed character n-gram features, L2-normalized, scored by sparse per-label weights, and ranked with a softmax transform. The final package exposes both the accepted label and the raw top prediction so applications can decide how much ambiguity to show to users.
Inputs up to 16 KiB are classified whole. Longer inputs are represented by windows from the start, middle, and end, which bounds runtime for large pastes while preserving common file-level signals.
| Component | Release setting |
|---|---|
| n-gram range | 1 through 5 characters |
| Hash buckets | 32,768 |
| Feature value | log1p(count), L2-normalized |
| Retained weights | 1,500 per label |
| Serialized precision | 4 decimal places |
Data
The dataset was built iteratively from generated structured text, curated programming examples, realistic project fragments, targeted hard-neighbor cases, and plain-text fallback examples. Candidate splits were checked for exact and normalized duplicate overlap.
For programming languages, examples were written as complete files or realistic project fragments: CLIs, web handlers, tests, services, configuration loaders, migrations, and framework code. For structured formats such as JSON, CSV, and logs, examples varied shape, length, field names, nesting, and value distributions rather than repeating one template.
| Split | Rows | Labels | Median chars | P90 chars |
|---|---|---|---|---|
| Train | 2,618 | 28 | 210.5 | 1,124 |
| Validation | 487 | 28 | 213 | 1,192 |
| Test | 532 | 28 | 180 | 1,049 |
The final training data emphasizes realistic short snippets, source-grouped splits, and explicit hard cases for ambiguous neighboring labels such as HTML/XML, INI/TOML, JSON/log, Markdown/plain text, and CSV/plain text.
Training Loop
A major objective of training the n-gram was testing how much of training a model can be performed semi-autonomously by
an LLM Agent. n-grams are incredibly fast to train, and afforded the model many opportunities to run experiences and
build up a training dataset over time that specifically targeted failures in the trained models. A key part of getting
that loop to work was ensuring the data stayed representative of a real-world text distribution while keeping the model
from over-fitting to the generated dataset. The Agent was instructed to keep notes, an experiments log, and a data
quality log. It did require a bit of nudging along the way, such as prompting it to manually inspect some of the
generated data and note the quality issues (For example early C examples were almost all duplicates of 4-5 lines of a
main() printing a few words).
Each round started with a candidate model trained on the current corpus. The agent evaluated it against the normal validation/test split, targeted hard cases for confused labels and the held-out evaluation suites. When the model failed, a subagent was used to describe the failures as broader missing shape: short prose-heavy Markdown, TOML that looks like INI, JSON-looking application logs, small diffs, short language snippets without imports, and so on.
Those descriptions were then used to create new train-only examples. For programming languages, the examples were usually generated as realistic files or fragments rather than templates. For structured formats, the generation could be more programmatic, but still had to vary field names, lengths, nesting, and punctuation. Before a new training split was used, the added rows were checked against the rest of the corpus for exact or near-duplicate overlap.
A separate LLM subagent was used to create new held-out evaluation suites between rounds. This was important to help ensure the overall quality of the train dataset. We want the generated examples to be different enough to keep the model robust. The held-out suites were generated from label-level requirements and broad scenario prompts, then locked before the next candidate was evaluated. Once a held-out suite exposed an error category and influenced the next data pass, it was no longer treated as a final benchmark. It became a regression suite.
The loop looked like this:
- train a candidate on the current corpus
- evaluate against validation, test, hard cases, and held-out evaluation suites
- inspect failures as categories and patterns
- generate train-only coverage for those categories
- check duplicate and near-duplicate overlap
- train the next candidate
- create fresh held-out coverage when a release-quality claim is needed
This process let the dataset grow in the directions the n-gram actually needed while keeping a clean separation between training data and held-out evaluation suites.
| Evaluation role | Rows | Purpose |
|---|---|---|
| Validation + test | 1,019 | Main split evaluation |
| Held-out suites | 599 | Independent checks generated across training rounds |
| Hard-neighbor cases | 148 | Targeted stress cases for commonly confused labels |
Results
| Evaluation set | Examples | Accuracy | Macro F1 |
|---|---|---|---|
| Validation split | 487 | 1.0000 | 1.0000 |
| Test split | 532 | 0.9962 | 0.9926 |
| Held-out evaluation suites | 599 | 0.9816 | 0.9819 |
| Hard-neighbor cases | 148 | 0.9932 | 0.6905 |
SNIP scores every label and selects the highest-scoring label. The margin shows how far that winner is ahead of the runner-up.
Runtime and Size
The runtime is authored in TypeScript and published as plain JavaScript with generated declaration files. It has no runtime dependencies. The raw model JSON is 626,595 bytes, and the gzip-compressed model is 204,593 bytes.
| Input size | Sampled chars | P50 ms | P95 ms |
|---|---|---|---|
| 1 KB | 1,024 | 1.490 | 1.548 |
| 16 KB | 16,384 | 6.580 | 6.730 |
| 100 KB | 12,292 | 5.180 | 5.310 |
| 1 MB | 12,292 | 5.170 | 5.310 |
| 5 MB | 12,292 | 5.210 | 5.380 |
Limitations
- Very short snippets may lack enough evidence for a specific syntax label.
- TypeScript and JavaScript can be close when a snippet has no type syntax.
- JSON-lines application logs can be close to JSON.
- Markdown/plain-text separation can be weak on very short prose-like Markdown.
- Large JSON manifests with embedded language-labeled snippets can look like source code.
Labels
bashccppcsharpcsscsv
diffdockerfilegohtmlinijava
javascriptjsonlogluamarkdownphp
plain_textpowershellpythonrubyrustsql
tomltypescriptxmlyaml
Install and API
The package exports classifyText and classifyTextAsync, and embeds the SNIP model directly
in the runtime. There is no model fetch step for browser applications.
| Export | Use |
|---|---|
classifyText(text) |
Synchronous classification with the embedded SNIP model. |
classifyTextAsync(text) |
Yields once, then classifies. Useful for UI code that prefers an awaitable API. |
sampleText(text) |
Returns the bounded sample SNIP will classify for a given input. |
Results include label, predicted_label, confidence, margin, and
the top five alternatives. SNIP scores every label and returns the highest-scoring label as the
prediction. margin is the gap between the winning label and the runner-up.