Tag: LLM serving

How to Choose Batch Sizes to Minimize Cost per Token in LLM Serving

How to Choose Batch Sizes to Minimize Cost per Token in LLM Serving

by Phillip Ramos

Learn how to choose optimal batch sizes for LLM serving to cut cost per token by up to 87%. Discover real-world results, batching types, hardware trade-offs, and proven techniques to reduce AI infrastructure costs.

Transformer Efficiency Tricks: KV Caching and Continuous Batching in LLM Serving

Transformer Efficiency Tricks: KV Caching and Continuous Batching in LLM Serving

by Phillip Ramos

KV caching and continuous batching are essential for fast, affordable LLM serving. Learn how they reduce memory use, boost throughput, and enable real-world deployment on consumer hardware.

Categories

Archives

Recent-posts

Calibration and Outlier Handling in Quantized LLMs: How to Keep Accuracy When Compressing Models

Calibration and Outlier Handling in Quantized LLMs: How to Keep Accuracy When Compressing Models

Jul, 6 2025

Latency Optimization for Large Language Models: Streaming, Batching, and Caching

Latency Optimization for Large Language Models: Streaming, Batching, and Caching

Aug, 1 2025

Fine-Tuned Models for Niche Stacks: When Specialization Beats General LLMs

Fine-Tuned Models for Niche Stacks: When Specialization Beats General LLMs

Jul, 5 2025

Team Size Compression: How to Deliver More with Smaller, Leaner Teams

Team Size Compression: How to Deliver More with Smaller, Leaner Teams

May, 8 2026

Code Generation with LLMs: Boosting Productivity and Managing the Limits

Code Generation with LLMs: Boosting Productivity and Managing the Limits

Apr, 21 2026