Tag: LLM inference

GPU Selection for LLM Inference: A100 vs H100 vs CPU Offloading

GPU Selection for LLM Inference: A100 vs H100 vs CPU Offloading

by Phillip Ramos

Learn how to choose between NVIDIA A100, H100, and CPU offloading for LLM inference in 2025. See real performance numbers, cost trade-offs, and which option actually works for production.

Transformer Efficiency Tricks: KV Caching and Continuous Batching in LLM Serving

Transformer Efficiency Tricks: KV Caching and Continuous Batching in LLM Serving

by Phillip Ramos

KV caching and continuous batching are essential for fast, affordable LLM serving. Learn how they reduce memory use, boost throughput, and enable real-world deployment on consumer hardware.

Categories

Archives

Recent-posts

Allocating LLM Costs Across Teams: Chargeback Models That Actually Work

Allocating LLM Costs Across Teams: Chargeback Models That Actually Work

Jul, 26 2025

How Vibe Coding Delivers 126% Weekly Throughput Gains in Real-World Development

How Vibe Coding Delivers 126% Weekly Throughput Gains in Real-World Development

Jan, 27 2026

Compressed LLM Evaluation: Essential Protocols for 2026

Compressed LLM Evaluation: Essential Protocols for 2026

Feb, 5 2026

Visualization Techniques for Large Language Model Evaluation Results

Visualization Techniques for Large Language Model Evaluation Results

Dec, 24 2025

Why Multimodality Is the Future of Generative AI Beyond Text-Only Systems

Why Multimodality Is the Future of Generative AI Beyond Text-Only Systems

Nov, 15 2025