Mac 跑本地大模型用什么模型最合适

M1 M2 16GB 推荐 Qwen 2.5 7B 或 Llama 3.1 8B 跑起来流畅。M2 Pro Max 36GB 以上可以试 Qwen 2.5 32B 或 DeepSeek 32B 体验明显高一个档次。中文选 Qwen 代码任务选 Qwen Coder 或 DeepSeek Coder 英文创意选 Llama。

本地大模型能联网搜索吗

Ollama 默认不联网。如果想让模型联网要在前端层加搜索能力。Open WebUI 有官方的 Web Search 功能接入 SearXNG 或 Tavily API 等搜索后端模型可以先调搜索再生成回答。

Ollama 接 API 怎么用

Ollama 默认暴露 OpenAI 兼容的 REST API 端口 11434。把 Continue.dev Cline Cursor 的 API 端点改成 localhost 11434 v1 模型名填本地拉过的模型就能在编辑器里用本地模型跑 AI 编程。

本地大模型耗电吗

跑推理时显卡或 CPU 长时间高负载功耗确实显著高于待机。M2 MacBook Pro 跑 7B 模型整机功耗大致几十瓦量级。NVIDIA 4090 推理时单卡功耗几百瓦。Mac 笔记本跑久了底壳会明显发热但软硬件层都有保护。

我的本地模型为什么答非所问

模型太小 3B 以下能力有限。prompt 不清楚要更明确写完整。上下文不够 Ollama 默认 context 2048 长对话会被截断。模型量化太狠 Q2 会显著降低质量有条件用 Q4 以上。

Complete deployment tutorial of local large model, 2026 Use Ollama to run Llama and Qwen on your own computer

Q: 本地大模型耗电吗

跑推理时显卡或 CPU 长时间高负载 功耗确实显著高于待机。M2 MacBook Pro 跑 7B 模型整机功耗大致几十瓦量级。NVIDIA 4090 推理时单卡功耗几百瓦。Mac 笔记本跑久了底壳会明显发热但软硬件层都有保护。

📅 2026-05-21 11:18:55 👤 DouWen Editorial 💬 7 条评论 👁 3

Local large models will be an order of magnitude more playable in 2026 than they were two years ago. The parameters of open source models such as Llama series, Qwen series, DeepSeek distilled version, etc. range from billions to tens of billions. Both ordinary desktops and high-end notebooks can run one or two cost-effective models. The core benefits of local deployment are data privacy and zero quota anxiety, at the expense of graphics memory, memory and initial configuration threshold. This article uses the open source tool Ollama as the main line, from downloading, installing, running the model to connecting the front end, and explains the standard method of running large models on personal computers in 2026.

What is Ollama and what problem does it solve?

Ollama is an open source local large model running framework that has quickly become popular in the developer community since 2024. It packages the entire link of a large model from downloading, quantification, inference, and API exposure into one command, so novices can use it without understanding the model structure.

The pain points it solves are straightforward. In the past, to run a local large model, you had to install PyTorch or llama.cpp, download the original weights, write your own conversion script, and adjust the inference parameters. It took half a day to get started. Ollama hides all of this behind the scenes. To run a model, you only need the ollama run command.

Ollama supports macOS, Linux, and Windows across platforms, each with native installation packages. Particularly friendly to Apple Silicon users, the unified memory architecture of the M series chips allows Ollama to run large models significantly smoother than the Intel platform.

The first step is to check whether the hardware is qualified

When running local large models, video memory or unified memory is the first hard indicator. Rough correspondence: running a 7B model requires about 8GB of memory, 13B requires 16GB, 30B requires 32GB, and 70B requires 64GB to start. This is the minimum requirement for the 4-bit quantized version, and higher-precision versions will need to be doubled.

Please refer to several common scenarios for specific configuration. M1/M2/M3 MacBook Air 8GB can barely run 3B, and small models below 4B are recommended. The M2/M3 MacBook Pro 16GB is a sweet spot and runs the 7-13B model smoothly. M3 Max 36GB or M4 Pro 24GB can run 30B model at the beginning, and the generation speed is usable. The gaming PC is equipped with RTX 4070/4080/4090 12-24GB video memory, and runs the 13-30B model very smoothly.

Ordinary office notebooks with 8GB of memory can basically only play with 1-3B small models, and the inference speed is usable. If you want to seriously work with local large models, investing in a device with more than 16GB of memory is the basic threshold.

Step 2, download and install Ollama

Go to ollama.com to download the installation package for the corresponding system. Mac is .dmg, Windows is .exe, and Linux is the curl installation script. The installation process is very simple, just drag it directly to Applications on Mac, and double-click the installation on Windows to complete.

After the installation is complete, Ollama will run as a background service. Open the terminal and enter ollama --version. If the version number is displayed, the installation is successful.

An extra note for Mac users: Ollama listens to port 127.0.0.1:11434 by default. If you want to access from other devices on the LAN, set OLLAMA_HOST=0.0.0.0 in the system environment variable, and then restart the Ollama service.

Linux users can check the service status with systemctl status ollama. If the GPU is not recognized, you may need to install the NVIDIA Container Toolkit or ROCm driver, depending on your graphics card model.

The third step is to pull the first model

Ollama's model library covers mainstream open source models. Common entry options:

The Llama series is produced by Meta and has strong general capabilities, and its English performance is better than Chinese. Command ollama pull llama3.1:8b to pull an 8 billion parameter version, the default 4-bit quantization is about 4-5GB.

The Qwen series is produced by Alibaba. It has strong Chinese skills and good coding skills. ollama pull qwen2.5:7b pulls the 7 billion parameter version. The new generation Qwen3 is also online in the Ollama library and you can try it.

The DeepSeek series is well optimized for coding tasks. ollama pull deepseek-r1:7b pulls a reasoning optimized version, which looks small but has good logical reasoning capabilities.

The Phi series is a small model produced by Microsoft. It can run on small devices with 3-4B parameters and 4GB memory. ollama pull phi3:mini.

Gemma is Google's open source model, with various specifications of 2-9B. ollama pull gemma2:9b is a relatively easy to use medium model.

The first pull will take several GB to download, so be patient. After the download is complete, use ollama list to view existing local models.

Step 4: Run the model dialogue

The most direct experience: enter ollama run qwen2.5:7b in the terminal, and you can have a direct conversation after the model is loaded.

It takes tens of seconds to load the model for the first time, and it will be much faster to start it later. The M2 MacBook Pro 16GB runs a 7B model, and the generation speed is roughly dozens of tokens per second. The feel is close to the ChatGPT web version and smooth.

Exit the conversation with /bye or Ctrl+D. Ollama will keep the model in memory for a few minutes, during which time it will start again when it is started again. If you want to free the memory immediately, run ollama stop qwen2.5:7b.

Advanced usage is to use parameters to control the build quality. ollama run qwen2.5:7b After entering the dialogue, enter /set parameter temperature 0.3 to lower the temperature. The model answer will be more stable; above 0.8, it will be more creative.

Step 5: Connect to Open WebUI to make the interface look like ChatGPT

The terminal experience is not friendly enough for ordinary users. Open WebUI is the most popular Ollama front-end in the open source community. Its interface is close to ChatGPT and supports multi-session, Markdown, code highlighting, RAG and other functions.

The fastest way to install is Docker, a one-line command:

docker run -d -p 3000:8080 -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

After running, visit localhost:3000 with the browser and register the first account (a local account, which will not go to the cloud). In the settings, Ollama Endpoint points to host.docker.internal:11434 by default and it will automatically recognize the model you pulled.

The experience after that is almost the same as ChatGPT. You can create multiple conversations, switch between different models to compare effects, upload files for RAG Q&A, and all the data is in your local area.

If you don’t want to install Docker, Open WebUI also supports pip installation and running in Python environment, but Docker isolation and clean isolation are more recommended.

Model selection, Chinese scene measurement suggestions

The most common question when running a local large model is "which model to choose?" Here are some practical suggestions based on scenarios.

For Chinese writing and daily conversation, Qwen 2.5 7B or Qwen 3 series is the first choice. Chinese expression is natural and smooth, and the knowledge deadline is relatively new.

For coding tasks, the DeepSeek Coder series and Qwen 2.5 Coder series are both at the top level. The 7B version can complete most daily coding tasks, and the 30B version is close to the first-line closed source model level.

For English writing and creativity, Llama 3.1/3.2 series and Mistral series perform better than Chinese models, but Chinese support is slightly weaker.

If the hardware is too tight and can only run below 3B, Phi3 mini is one of the best overall among 3-4B, and Gemma 2B can also be used in emergencies.

The comprehensive capabilities of 70B-level models (such as Llama 3.3 70B, Qwen 2.5 72B) are close to the early level of GPT-4, but they require more than 64GB of memory to run. Do not try with ordinary configurations.

Several common tips for performance optimization

Stuttering is the most common problem encountered by novices. Several common optimization directions.

First, choose the right size. If the hardware is not enough, choose a small model. Don't forcefully install a large model and expect miracles. The smooth running experience of 13B is much better than the lag of 30B.

Second, give priority to using the GGUF quantified version. Ollama provides the quantized version by default, usually Q4_K_M or Q5_K_M. If the quality requirements are high and the video memory is sufficient, you can pull the Q8 version (the command has the suffix: 8b-q8_0). The answer quality will be significantly improved at the cost of doubling the video memory.

Third, turn off unnecessary background programs. When inferring local large models, the graphics memory and memory usage are very high. Opening dozens of browser tabs, IDEs, and Docker containers running at the same time will significantly slow down the inference.

Fourth, control the context length. Ollama's default context is 2048 tokens, and long contexts consume more video memory. If you only do short questions and answers, this default value is just right; if you do long document summaries, you need to set a large context, at the cost of slowness.

Real usage scenarios of local large models

Many people install large local models and then leave them idle for a few days because they have not found a suitable scene. Three really useful directions.

The first is privacy-sensitive conversations and document processing. When it comes to business contracts, internal documents, and personal privacy data, local operation can completely avoid the compliance risks of going to the cloud.

The second is stable auxiliary workflow. For example, batch translation, batch summarization, and batch generation of structured data. The local model has no speed limit, no limit, and can run without an Internet connection. It is suitable for unattended tasks.

The third is exploration and learning. Learn the concepts of RAG, Function Call, and Agent, and experiment with local models for free. The cost of failure is 0, and it is much faster to understand than just reading the documentation.

If you just chat on a daily basis and occasionally write documents, cloud ChatGPT or domestic large models are enough. The real value of local large models lies in the three dimensions of batch size, privacy, and controllability.

FAQ

What model is most suitable for running local large models on Mac?

M1/M2 16GB recommends Qwen 2.5 7B or Llama 3.1 8B. It runs smoothly and can be used in both Chinese and English. For M2 Pro/Max 36GB or above, you can try Qwen 2.5 32B or DeepSeek 32B, and the experience will be significantly higher. If you want to do Chinese, choose Qwen. If you want to do coding tasks, choose Qwen Coder or DeepSeek Coder. If you want to do English creativity, choose Llama. Mac Pro or Mac Studio 64GB and above can challenge the 70B model.

Can local large models be searched online?

Ollama is not connected to the Internet by default and only runs local inference. If you want the model to be networked, you need to add search capabilities to the front-end layer. Open WebUI has an official Web Search function, which is connected to search backends such as SearXNG or Tavily API. The model can first adjust the search and then generate answers. You can also use LangChain, LlamaIndex and other frameworks to build the search + RAG process yourself. The effect of this combination is close to the browsing experience of ChatGPT, but the configuration threshold is higher than that of pure conversation.

How to use Ollama to connect to API

Ollama exposes the OpenAI-compatible REST API by default, port 11434, and most tools that support the OpenAI protocol can connect directly. For example, if you change the API endpoints of Continue.dev, Cline, and Cursor to http://localhost:11434/v1, fill in the model name with the locally pulled model, you can use the local model to run AI programming in the editor. Note that local model code capabilities are weaker than cloud Claude/GPT, and are suitable for simple tasks or scenarios where you don’t want to spend money.

Does the local large model consume power?

When running inference, the graphics card or CPU is under high load for a long time, and the power consumption is indeed significantly higher than in standby. The M2 MacBook Pro runs the 7B model, and the power consumption of the entire machine during continuous generation is on the order of tens of watts. NVIDIA 4090 desktop graphics card consumes several hundred watts per card during inference. If you do long-term batch tasks, it is recommended to pay attention to heat dissipation and electricity costs. The bottom case of a Mac notebook will obviously heat up after running for a long time, but the software and hardware layers are protected from damaging the device.

Why is my local model not answering the question?

Several common reasons. First, the model is too small. Models below 3B have limited capabilities. It is normal to get wrong answers. Switching to a 7B or larger model will immediately improve it. The second is that the prompt is unclear. The local model is not like ChatGPT that can "guess" your intention. The background and requirements must be written more clearly and completely. Third, the context is not enough. Ollama defaults to context 2048. Long conversations will be truncated and the previous text will be forgotten. You need to set OLLAMA_NUM_CTX in the configuration or increase max tokens in Open WebUI. Fourth, the model quantization is too harsh. The 2-bit quantization of Q2 will significantly reduce the quality. If possible, use Q4 or above.

📝 本文来自抖文 www.douwen.me ，转载请保留出处。

原文链接：https://douwen.me/archives/1116/

💬 评论 (7)

GrowthHacker 2026-05-21 10:35 回复

Step-by-step is gold.

AIWatcher 2026-05-20 18:08 回复

Practical tips not fluff.

DataNerd 2026-05-21 09:58 回复

Easy to follow.

DataNerd 2026-05-21 02:00 回复

Best summary I've read on this.

GrowthHacker 2026-05-21 10:55 回复

Sharing this with my team.

DigitalNomad 2026-05-20 16:22 回复

Loved the FAQ section.

AIWatcher 2026-05-21 10:04 回复

Solid breakdown, very useful.