SNIP: Small N-gram Identifier for Pastes

Overview

SNIP is a compact text classifier for pasted text, snippets, logs, configuration files, and text-like source files. The release model predicts 28 source and text labels from bounded text samples using hashed character n-grams and a sparse linear classifier. It is intended for pastebin sites, editors, and developer tools that need a useful syntax suggestion immediately after text is pasted without sending the pasted content to a server. The model is embedded in a dependency-free TypeScript runtime and released under the BSD 3-Clause license.

Problem

Pastebin sites, editors, and developer tools often need a useful syntax suggestion immediately after text is pasted. SNIP is designed for that narrow workflow: fast browser-local inference without sending pasted content to a server. It predicts a likely syntax or text label for pasted text, snippets, logs, configuration files, and text-like source files. Binary file detection, MIME sniffing, and full file-type forensics are outside the current scope.

Method

The release model is a multiclass linear classifier trained with passive-aggressive updates. Input text is sampled, converted to hashed character n-gram features, L2-normalized, scored by sparse per-label weights, and ranked with a softmax transform. The final package exposes both the accepted label and the raw top prediction so applications can decide how much ambiguity to show to users.

Inputs up to 16 KiB are classified whole. Longer inputs are represented by windows from the start, middle, and end, which bounds runtime for large pastes while preserving common file-level signals.

Component	Release setting
n-gram range	1 through 5 characters
Hash buckets	32,768
Feature value	`log1p(count)`, L2-normalized
Retained weights	1,500 per label
Serialized precision	4 decimal places

Data

The dataset was built iteratively from generated structured text, curated programming examples, realistic project fragments, targeted hard-neighbor cases, and plain-text fallback examples. Candidate splits were checked for exact and normalized duplicate overlap.

For programming languages, examples were written as complete files or realistic project fragments: CLIs, web handlers, tests, services, configuration loaders, migrations, and framework code. For structured formats such as JSON, CSV, and logs, examples varied shape, length, field names, nesting, and value distributions rather than repeating one template.

Split	Rows	Labels	Median chars	P90 chars
Train	2,618	28	210.5	1,124
Validation	487	28	213	1,192
Test	532	28	180	1,049

The final training data emphasizes realistic short snippets, source-grouped splits, and explicit hard cases for ambiguous neighboring labels such as HTML/XML, INI/TOML, JSON/log, Markdown/plain text, and CSV/plain text.

Training Loop

A major objective of training the n-gram was testing how much of training a model can be performed semi-autonomously by an LLM Agent. n-grams are incredibly fast to train, and afforded the model many opportunities to run experiences and build up a training dataset over time that specifically targeted failures in the trained models. A key part of getting that loop to work was ensuring the data stayed representative of a real-world text distribution while keeping the model from over-fitting to the generated dataset. The Agent was instructed to keep notes, an experiments log, and a data quality log. It did require a bit of nudging along the way, such as prompting it to manually inspect some of the generated data and note the quality issues (For example early C examples were almost all duplicates of 4-5 lines of a main() printing a few words).

Each round started with a candidate model trained on the current corpus. The agent evaluated it against the normal validation/test split, targeted hard cases for confused labels and the held-out evaluation suites. When the model failed, a subagent was used to describe the failures as broader missing shape: short prose-heavy Markdown, TOML that looks like INI, JSON-looking application logs, small diffs, short language snippets without imports, and so on.

Those descriptions were then used to create new train-only examples. For programming languages, the examples were usually generated as realistic files or fragments rather than templates. For structured formats, the generation could be more programmatic, but still had to vary field names, lengths, nesting, and punctuation. Before a new training split was used, the added rows were checked against the rest of the corpus for exact or near-duplicate overlap.

A separate LLM subagent was used to create new held-out evaluation suites between rounds. This was important to help ensure the overall quality of the train dataset. We want the generated examples to be different enough to keep the model robust. The held-out suites were generated from label-level requirements and broad scenario prompts, then locked before the next candidate was evaluated. Once a held-out suite exposed an error category and influenced the next data pass, it was no longer treated as a final benchmark. It became a regression suite.

The loop looked like this:

train a candidate on the current corpus
evaluate against validation, test, hard cases, and held-out evaluation suites
inspect failures as categories and patterns
generate train-only coverage for those categories
check duplicate and near-duplicate overlap
train the next candidate
create fresh held-out coverage when a release-quality claim is needed

This process let the dataset grow in the directions the n-gram actually needed while keeping a clean separation between training data and held-out evaluation suites.

Evaluation role	Rows	Purpose
Validation + test	1,019	Main split evaluation
Held-out suites	599	Independent checks generated across training rounds
Hard-neighbor cases	148	Targeted stress cases for commonly confused labels

Results

Evaluation set	Examples	Accuracy	Macro F1
Validation split	487	1.0000	1.0000
Test split	532	0.9962	0.9926
Held-out evaluation suites	599	0.9816	0.9819
Hard-neighbor cases	148	0.9932	0.6905

SNIP scores every label and selects the highest-scoring label. The margin shows how far that winner is ahead of the runner-up.

Runtime and Size

The runtime is authored in TypeScript and published as plain JavaScript with generated declaration files. It has no runtime dependencies. The raw model JSON is 626,595 bytes, and the gzip-compressed model is 204,593 bytes.

Input size	Sampled chars	P50 ms	P95 ms
1 KB	1,024	1.490	1.548
16 KB	16,384	6.580	6.730
100 KB	12,292	5.180	5.310
1 MB	12,292	5.170	5.310
5 MB	12,292	5.210	5.380

Limitations

Very short snippets may lack enough evidence for a specific syntax label.
TypeScript and JavaScript can be close when a snippet has no type syntax.
JSON-lines application logs can be close to JSON.
Markdown/plain-text separation can be weak on very short prose-like Markdown.
Large JSON manifests with embedded language-labeled snippets can look like source code.

Labels

bashccppcsharpcsscsv diffdockerfilegohtmlinijava javascriptjsonlogluamarkdownphp plain_textpowershellpythonrubyrustsql tomltypescriptxmlyaml

Install and API

npm install @wesr/snip

The package exports classifyText and classifyTextAsync, and embeds the SNIP model directly in the runtime. There is no model fetch step for browser applications.

import { classifyText } from "@wesr/snip";

const result = classifyText(text);
console.log(result.label);

import { classifyTextAsync } from "@wesr/snip";

const result = await classifyTextAsync(text);

Export	Use
`classifyText(text)`	Synchronous classification with the embedded SNIP model.
`classifyTextAsync(text)`	Yields once, then classifies. Useful for UI code that prefers an awaitable API.
`sampleText(text)`	Returns the bounded sample SNIP will classify for a given input.

Results include label, predicted_label, confidence, margin, and the top five alternatives. SNIP scores every label and returns the highest-scoring label as the prediction. margin is the gap between the winning label and the runner-up.