// TOKEN FACTORY

Token Factory

Serverless inference API for the best open-source models. Pay per token, scale instantly, no infrastructure to manage.

Start for free

Begin with $1 in free credits to explore our models through the Playground or API. Start building in minutes.

Playground

A web interface to try out and compare different AI models without writing any code. Test prompts, adjust parameters, see results instantly.

Two flavors

Choose between fast flavor for time-sensitive tasks and base flavor for economical processing of larger workloads.

Text to text

Prices shown are per 1 million tokens. Batch inference is automatically billed at 50% of the base real-time model price.

Model	Flavor	Input / 1M tokens	Output / 1M tokens
DeepSeek-R1-0528	FAST	$2.00	$6.00
	BASE	$0.80	$2.40
DeepSeek-V3-0324	FAST	$0.75	$2.25
	BASE	$0.50	$1.50
Llama-3.3-70B-Instruct	FAST	$0.25	$0.75
	BASE	$0.13	$0.40
Llama-3.1-405B-Instruct	BASE	$1.00	$3.00
Llama-3.1-8B-Instruct	FAST	$0.03	$0.09
	BASE	$0.02	$0.06
Qwen3-235B-A22B	BASE	$0.20	$0.80
Qwen3-32B	FAST	$0.20	$0.60
	BASE	$0.10	$0.30
QwQ-32B	FAST	$0.50	$1.50
	BASE	$0.15	$0.45
Gemma-2-9b-it	BASE	$0.03	$0.09
Gemma-2-2b-it	BASE	$0.02	$0.06

Vision

Multimodal models that accept both text and image inputs. Prices per 1 million tokens.

Model	Flavor	Input / 1M tokens	Output / 1M tokens
Qwen2.5-VL-72B-Instruct	BASE	$0.30	$0.90
Llama-3.2-11B-Vision	BASE	$0.05	$0.15

Embeddings

Convert text into high-dimensional vector representations for search, similarity, and retrieval.

Model	Price / 1M tokens
BAAI/bge-en-icl	$0.02
BAAI/bge-multilingual-gemma2	$0.02
intfloat/e5-mistral-7b-instruct	$0.02

How it works

Get your API key

Call the API

OpenAI-compatible API. Switch your base URL and you're running on {{COMPANY_NAME}} infrastructure. Drop-in replacement.

Scale automatically

No capacity planning. We handle auto-scaling, load balancing, and failover. You just send requests.

OpenAI-compatible API

Switch your existing OpenAI code to {{COMPANY_NAME}} with a single line change. Our API is fully compatible — same request format, same response structure.

Supports streaming, function calling, JSON mode, and all standard chat completion parameters.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.company.com/v1",
    api_key="your-api-key",
)

response = client.chat.completions.create(
    model="DeepSeek-R1-0528",
    messages=[{
        "role": "user",
        "content": "Explain transformers."
    }]
)

Questions and answers

We host the most popular open-source models including DeepSeek R1 & V3, Llama 3.3 & 3.1 (8B to 405B), Qwen3 (14B to 235B), QwQ-32B, Gemma 2, and embedding models. New models are added regularly.

Fast flavor uses more GPU resources per request for lower latency — ideal for real-time applications. Base flavor is optimized for throughput and cost — ideal for batch processing and async workloads.

Yes. Our API follows the OpenAI chat completions format. You can use any OpenAI SDK by changing the base URL and API key. Supports streaming, function calling, and JSON mode.

Batch inference is automatically billed at 50% of the base real-time model price. Submit requests in bulk and results are returned asynchronously — ideal for large-scale data processing.

Default rate limits are generous and scale with your usage. For enterprise workloads requiring higher limits, contact our sales team for custom arrangements.

Explore

GPU Cloud Pricing → Documentation → AI Cloud →

All prices are shown without any applicable taxes, including VAT. Prices are per 1 million tokens unless otherwise noted.