Talks AI

Running Large Language Models on Laptops: Practical Quantization Techniques in Python

Friday, May 15th, 2026 2 p.m.–2:30 p.m. in Grand Ballroom A

Presented by

Experience Level:

Some experience

Description

Large Language Models are often assumed to require expensive GPUs and large cloud budgets. In practice, recent Python tooling makes it possible to run and experiment with LLMs efficiently on consumer hardware.

This talk focuses on practical quantization techniques for LLMs using Python, with an emphasis on real-world tradeoffs rather than theory. We will compare commonly used approaches such as QLoRA, bitsandbytes, GGUF, and GGML, and discuss how each impacts memory usage, latency, and output quality.

The examples and benchmarks in this talk are drawn from applied experimentation on real-world text datasets, but the focus remains on generalizable lessons for Python developers, not domain-specific claims. Attendees will see how quantization choices affect model behavior, when aggressive compression helps, and when it introduces unexpected failure modes.

By the end of the session, attendees will be able to:

Understand when LLM quantization is appropriate
Choose between common quantization formats and libraries
Run LLMs locally on limited hardware using Python
Avoid common pitfalls when moving from experimentation to deployment

This talk is aimed at Python developers who want to work with modern LLMs without relying exclusively on high-end infrastructure.