hardware local-llm private-ai gemma-4 ollama

What Hardware Do You Need to Run AI Locally? A Practical Guide for IT Teams

Sasa Abe | | 11 min read

When IT teams evaluate local AI deployment, the first question is almost always about hardware. The answer is simpler than most expect, but it hinges on one factor that is widely misunderstood: memory, not processing speed, determines what your local AI can do. This guide explains what mid-size firms need to know about hardware before deploying models like Google DeepMind's Gemma 4 through Ollama and the Hermes Agent — without overbuying or underbuilding.

The One Thing That Matters Most: Memory

Most IT teams instinctively focus on processor speed when evaluating hardware for AI. That is the wrong priority.

The single most important factor in local AI performance is how much memory is available to the model — specifically, how much of the model can sit in fast, GPU-accessible memory at once.

The best analogy is a commercial kitchen.

  • The GPU is the chef. A faster chef is better, but speed is not the bottleneck.
  • VRAM (or unified memory on a Mac) is the kitchen counter. This is where the chef actually works. If the counter is too small, the chef cannot lay out all the ingredients and has to keep going back and forth to storage.
  • System RAM is the storage room. It holds everything the chef might need, but pulling ingredients from storage is dramatically slower than working from the counter.

When a model fits comfortably in available memory, responses come back in seconds. When it does not fit, the system starts swapping data between fast memory and slow memory. Performance does not degrade gradually — it collapses. A model that responds in two seconds when it fits in memory might take thirty seconds or more when it does not.

For IT teams sizing hardware, the practical rule is straightforward: buy memory, not speed.

How Much Memory Do AI Models Actually Need?

AI models are measured by their parameter count. More parameters generally means more capability — but also more memory.

To make large models practical on standard hardware, they are compressed using a technique called quantisation. At 4-bit quantisation (the default in Ollama), the approximate memory requirements are:

  • 7-8 billion parameters — approximately 5 GB
  • 14 billion parameters — approximately 10 GB
  • 27-32 billion parameters — approximately 18-20 GB
  • 70 billion parameters — approximately 40 GB

These figures cover the model itself. On top of that, the system needs memory for the operating system, the agent framework, and the conversation context. Longer conversations consume more memory as the model holds more context in its working window.

For the model we recommend — Gemma 4 27B running through Ollama — you need approximately 18 GB of available memory for the model, plus headroom for everything else. That is why 32 GB is our baseline recommendation.

Other strong open-source models worth knowing about include Llama 3 (Meta) and Mistral (Mistral AI), both available through Ollama. Gemma 4 remains our primary recommendation because of its benchmark performance, Apache 2.0 licensing, and native document handling capabilities, but the memory requirements across models of similar size are comparable.

The Mac Path: Unified Memory Makes It Simple

For most mid-size firms, our primary recommendation is an Apple Mac Mini or Mac Studio with 32 GB of unified memory.

Apple Silicon uses a unified memory architecture, which means the GPU and CPU share the same memory pool. In practical terms, this means all 32 GB is available to the model — there is no separate, smaller VRAM allocation to worry about.

This is a significant advantage for local AI deployment:

  • No VRAM bottleneck. The model accesses the full memory pool directly.
  • Simpler setup. Ollama runs natively on macOS with no driver configuration.
  • Quieter and more compact. No dedicated GPU means no fan noise and a small physical footprint — important for firms deploying hardware in an office rather than a server room.
  • Energy efficient. Apple Silicon draws a fraction of the power of a dedicated GPU setup.

The trade-off is raw speed. A dedicated NVIDIA GPU will process tokens faster than Apple Silicon at the same memory tier. But for most professional workflows — drafting, summarisation, document review, internal knowledge search — the difference between a two-second and a four-second response is not meaningful. The Mac path prioritises simplicity, reliability, and the ability to deploy without specialist IT infrastructure.

Our recommended Mac configuration:

  • Apple Mac Mini (M-series) or Mac Studio with 32 GB unified memory
  • Runs Gemma 4 27B at Q4 quantisation (~18 GB) with comfortable headroom
  • Response times of two to four seconds for typical queries
  • Approximate cost: AUD $1,500 to $2,500 depending on configuration

For firms wanting to experiment before committing, a MacBook Pro with 16 GB can run smaller models like Gemma 4 E4B (the efficient 4.5 billion parameter variant) or Llama 3 8B for basic drafting and summarisation.

The PC Path: More Power, More Complexity

For firms with existing PC infrastructure or IT teams comfortable with GPU hardware, the PC path offers more raw performance — at the cost of additional setup complexity.

The critical component is the graphics card and its VRAM. This is where the kitchen counter analogy matters most. A PC might have 64 GB of system RAM, but if the GPU only has 8 GB of VRAM, the model will not run well.

Entry Level: 16 GB VRAM

A graphics card with 16 GB of VRAM — such as a current-generation mid-range NVIDIA card — paired with 64 GB of system RAM and a solid-state drive handles the basics well.

This setup runs 7-8 billion parameter models comfortably and can manage smaller quantised versions of larger models with some constraints. It is a reasonable starting point for firms that want to test local AI on existing infrastructure before investing further.

Best for: initial evaluation, basic drafting, simple summarisation, internal knowledge search with smaller models.

Mid-Range: 24 GB VRAM

A card with 24 GB of VRAM — the NVIDIA RTX 3090 remains excellent value in this tier — is the sweet spot for serious local deployment. At 24 GB, Gemma 4 27B fits comfortably with room for conversation context and agent overhead.

This is where local AI moves from experimental to genuinely productive. The firm can run the same capable models we recommend on the Mac path, with faster inference speeds on demanding tasks.

Best for: production deployment, running Gemma 4 27B or Mistral equivalents, multi-step agent workflows with the Hermes Agent.

High End: 24 GB VRAM with Fast Architecture

The current flagship consumer cards offer 24 GB with significantly faster memory bandwidth and processing. Combined with 128 GB of system RAM, this setup can experiment with 70 billion parameter models (though these require careful quantisation) and handles everything below that tier at impressive speeds.

Best for: firms running multiple concurrent users, experimenting with the largest open-source models, or handling high-volume document processing workflows.

Model Formats: A Detail IT Teams Should Know

Different model formats are optimised for different hardware. Using the wrong format means leaving performance on the table.

  • GGUF — the most common format for local deployment. Works well on both Mac and PC. This is the default format in Ollama.
  • MLX — optimised specifically for Apple Silicon. Can offer better performance on Macs than GGUF for some models.
  • AWQ — optimised for NVIDIA GPUs. Better inference speed on PC builds with compatible cards.

For most deployments through Ollama, the format selection is handled automatically. But IT teams doing manual model configuration or evaluating alternatives to Ollama (such as LM Studio or vLLM) should verify they are using the right format for their hardware.

Software: Two Tools Handle Most of the Setup

The software layer for local AI is surprisingly simple.

Ollama

Ollama is our recommended tool for hosting models locally. It is command-line based, runs on Mac, Windows, and Linux, and handles model downloads, memory management, and local serving. Running Gemma 4 is a single command:

ollama run gemma4:27b

Ollama exposes a local API endpoint that the Hermes Agent connects to directly. No cloud account, no API keys, no subscription.

LM Studio

For IT teams that prefer a graphical interface, LM Studio provides a ChatGPT-style experience for local models. It handles the same model downloads and hardware optimisation as Ollama but wraps it in a visual interface. This can be useful for initial evaluation or for staff who will interact with the model directly rather than through the Hermes Agent.

Common Mistakes to Avoid

Based on what we see from firms attempting local AI deployment without guidance, these are the most frequent missteps.

Buying GPU speed instead of VRAM

A fast graphics card with 8 GB of VRAM will perform worse for AI workloads than a slower card with 24 GB. This is counterintuitive for IT teams used to evaluating hardware for rendering or gaming, but it is the most important hardware decision to get right.

Trying to run AI on a Raspberry Pi

Raspberry Pi devices are excellent for many IT tasks. Local AI is not one of them. The memory and processing constraints make them impractical for any model capable of producing useful professional output. We see this attempted regularly and it consistently disappoints.

Ignoring conversation length

A model that runs smoothly for short queries may slow down during long sessions. Each exchange adds to the context the model holds in memory. For workflows that involve lengthy documents or multi-step analysis, factor in the memory consumed by conversation context — not just the model itself.

Assuming all 7B models are equal

Parameter count is a useful shorthand, but model architecture matters significantly. Gemma 4's 27B Mixture-of-Experts variant only activates 3.8 billion parameters per query, which is why it runs as efficiently as much smaller models while delivering substantially better output. When evaluating models, check the architecture — not just the headline number.

Skipping format matching

Running a GGUF model on hardware that would benefit from AWQ, or vice versa, leaves meaningful performance on the table. It takes five minutes to verify you are using the right format for your hardware. It is worth the check.

The AIRGAP LLM Perspective

"We consistently see firms overestimate the hardware they need and underestimate the importance of memory. A Mac Mini with 32 GB of unified memory running Gemma 4 through Ollama outperforms setups costing three times as much where the GPU was chosen for speed rather than memory. The right hardware decision is the one that matches your model to your memory — everything else is secondary."

— Sasa Abe, Co-Founder, AIRGAP LLM

Getting Started

If your firm is evaluating hardware for local AI deployment and wants to avoid costly missteps, contact AIRGAP LLM for a confidential consultation. We assess your use case, team size, and workflow requirements before recommending a hardware configuration — so you buy what you need and nothing you do not.

Frequently Asked Questions

Can we use hardware we already own?

Possibly. If your firm has machines with 16 GB or more of available memory (unified on Mac, or VRAM on PC), they may be suitable for smaller models. AIRGAP LLM can assess your existing infrastructure during the initial consultation and advise on whether it meets the requirements for your intended workflows.

Is 16 GB enough to get started?

For experimentation and lighter tasks, yes. A Mac with 16 GB of unified memory can run Gemma 4 E4B or Llama 3 8B for basic drafting, summarisation, and simple question-answering. For production use with Gemma 4 27B and the Hermes Agent, 32 GB is the practical minimum.

How much does a full local AI setup cost?

A Mac Mini with 32 GB of unified memory costs approximately AUD $1,500 to $2,500. The software is free — Ollama, the Hermes Agent (MIT license), and Gemma 4 (Apache 2.0 license) have no licensing fees. There are no ongoing per-query or per-user charges. For most firms, the total hardware cost is comparable to six to twelve months of enterprise cloud AI subscriptions.

Should we go Mac or PC?

For most mid-size firms without dedicated GPU infrastructure, Mac is the simpler path. Unified memory eliminates the VRAM bottleneck, setup is straightforward, and the hardware is compact and quiet enough for an office environment. PC builds offer more raw speed and flexibility but require more IT expertise to configure and maintain. Choose based on your team's existing capabilities and comfort level.

SA

Sasa Abe

Co-Founder, AIRGAP LLM

Software engineer specialising in privacy-focused AI architecture, RAG systems, and local LLM deployment for data-sensitive organisations.

About our team →

Want to See How This Works for Your Firm?

We'll walk you through a deployment that fits your setup — your documents, your infrastructure, your compliance requirements. No sales pitch.

Request a Consultation

Or email us directly at hello@airgapllm.com.au