Rate Limits
TopRouter implements rate limits to ensure fair usage and service stability. This page explains how rate limits work and how to handle them.
How Rate Limits Work
Rate limits are applied on a per-API-key basis. Limits are measured in:
- RPM — Requests Per Minute
- TPM — Tokens Per Minute
Rate Limit Headers
Each API response includes headers with rate limit information:
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 55
x-ratelimit-reset-requests: 30s
x-ratelimit-limit-tokens: 100000
x-ratelimit-remaining-tokens: 95000
x-ratelimit-reset-tokens: 15sHandling Rate Limits
Check Response Headers
python
response = client.chat.completions.create(
model="google/gemini-3.5-flash",
messages=[{"role": "user", "content": "Hello"}]
)
# Access rate limit info from headers
# Implement pacing based on remaining quotaExponential Backoff
When you receive a 429 status code, implement exponential backoff:
python
import time
import random
def retry_with_backoff(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if "429" in str(e):
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
else:
raise
raise Exception("Max retries exceeded")Best Practices
- Implement retry logic — Always handle 429 errors gracefully
- Add jitter — Use random delays to avoid thundering herd
- Batch requests — Combine multiple prompts when possible
- Cache responses — Store and reuse responses for identical queries
- Monitor usage — Track your request patterns in the Console
- Use streaming — Streaming doesn't reduce limits but improves perceived latency
Tips for High-Volume Usage
- Use faster, more cost-effective models for high-volume tasks
- Implement request queuing in your application
- Distribute requests across multiple API keys if needed
- Contact support for custom rate limit increases
INFO
Rate limits may vary by model and account tier. Check the Console for your current limits.
