#AIEngineering

Local AI in Practice - Part 3 of 3

Part 3 of 3 - Fast. Reliable. But are they actually good? Three models. Same hardware. Same constraints. All fast. All reliable. But when it actually matters - which one produces the best output? The context In Part 1, I benchmarked four Small Language Models (SLM) on raw inference speed on CPU-only constrained hardware. Llama 3.2:3b won every speed metric. In Part 2, I tested structured JSON output reliability across four schemas and four temperature settings. Gemma 3:4b delivered 100% success with zero retries - Qwen was eliminated on token budget grounds. ...

Local AI in Practice - Part 2 of 3

Part 2 of 3 - The models passed the speed test. Then I asked them for JSON. Why the fastest model lost its lead the moment I needed structured output - and what I had to build to make any of them reliable. Why speed is not enough In Part 1 of this series, Llama 3.2:3b won every speed metric - 13.4 tokens per second, lowest latency, fewest tokens generated. On paper, an obvious choice. ...

Local AI in Practice - Part 1 of 3

Part 1 of 3 : Why I stopped sending my personal files to OpenAI - and built a local AI benchmarking lab instead. How a billing problem with a personal AI assistant led to a rigorous offline benchmarking study of four small language models - and the infrastructure battle I did not expect to fight. The origin I have been building a personal AI assistant that connects to my entire Google Workspace and Microsoft 365. One interface. Ask it anything about your schedule, your files, your contacts. No switching apps. No copy-pasting context. ...