// TOKEN FACTORY
Token Factory
Serverless inference API for the best open-source models. Pay per token, scale instantly, no infrastructure to manage.
Start for free
Begin with $1 in free credits to explore our models through the Playground or API. Start building in minutes.
Playground
A web interface to try out and compare different AI models without writing any code. Test prompts, adjust parameters, see results instantly.
Two flavors
Choose between fast flavor for time-sensitive tasks and base flavor for economical processing of larger workloads.
Text to text
Prices shown are per 1 million tokens. Batch inference is automatically billed at 50% of the base real-time model price.
| Model | Flavor | Input / 1M tokens | Output / 1M tokens |
|---|---|---|---|
| DeepSeek-R1-0528 | FAST | $2.00 | $6.00 |
| BASE | $0.80 | $2.40 | |
| DeepSeek-V3-0324 | FAST | $0.75 | $2.25 |
| BASE | $0.50 | $1.50 | |
| Llama-3.3-70B-Instruct | FAST | $0.25 | $0.75 |
| BASE | $0.13 | $0.40 | |
| Llama-3.1-405B-Instruct | BASE | $1.00 | $3.00 |
| Llama-3.1-8B-Instruct | FAST | $0.03 | $0.09 |
| BASE | $0.02 | $0.06 | |
| Qwen3-235B-A22B | BASE | $0.20 | $0.80 |
| Qwen3-32B | FAST | $0.20 | $0.60 |
| BASE | $0.10 | $0.30 | |
| QwQ-32B | FAST | $0.50 | $1.50 |
| BASE | $0.15 | $0.45 | |
| Gemma-2-9b-it | BASE | $0.03 | $0.09 |
| Gemma-2-2b-it | BASE | $0.02 | $0.06 |
Vision
Multimodal models that accept both text and image inputs. Prices per 1 million tokens.
| Model | Flavor | Input / 1M tokens | Output / 1M tokens |
|---|---|---|---|
| Qwen2.5-VL-72B-Instruct | BASE | $0.30 | $0.90 |
| Llama-3.2-11B-Vision | BASE | $0.05 | $0.15 |
Embeddings
Convert text into high-dimensional vector representations for search, similarity, and retrieval.
| Model | Price / 1M tokens |
|---|---|
| BAAI/bge-en-icl | $0.02 |
| BAAI/bge-multilingual-gemma2 | $0.02 |
| intfloat/e5-mistral-7b-instruct | $0.02 |
How it works
Get your API key
Sign up and receive an API key instantly. Start with $1 in free credits — no credit card required.
Call the API
OpenAI-compatible API. Switch your base URL and you're running on {{COMPANY_NAME}} infrastructure. Drop-in replacement.
Scale automatically
No capacity planning. We handle auto-scaling, load balancing, and failover. You just send requests.
OpenAI-compatible API
Switch your existing OpenAI code to {{COMPANY_NAME}} with a single line change. Our API is fully compatible — same request format, same response structure.
Supports streaming, function calling, JSON mode, and all standard chat completion parameters.
from openai import OpenAI
client = OpenAI(
base_url="https://api.company.com/v1",
api_key="your-api-key",
)
response = client.chat.completions.create(
model="DeepSeek-R1-0528",
messages=[{
"role": "user",
"content": "Explain transformers."
}]
)
Questions and answers
We host the most popular open-source models including DeepSeek R1 & V3, Llama 3.3 & 3.1 (8B to 405B), Qwen3 (14B to 235B), QwQ-32B, Gemma 2, and embedding models. New models are added regularly.
Fast flavor uses more GPU resources per request for lower latency — ideal for real-time applications. Base flavor is optimized for throughput and cost — ideal for batch processing and async workloads.
Yes. Our API follows the OpenAI chat completions format. You can use any OpenAI SDK by changing the base URL and API key. Supports streaming, function calling, and JSON mode.
Batch inference is automatically billed at 50% of the base real-time model price. Submit requests in bulk and results are returned asynchronously — ideal for large-scale data processing.
Default rate limits are generous and scale with your usage. For enterprise workloads requiring higher limits, contact our sales team for custom arrangements.
Explore
All prices are shown without any applicable taxes, including VAT. Prices are per 1 million tokens unless otherwise noted.