popularDr.W posted May 06, 2026 04:48 PM
Item 1 of 2
Item 1 of 2
popularDr.W posted May 06, 2026 04:48 PM
Lenovo Legion Pro 7i: 16" 2.5K 240Hz OLED, Intel Ultra 9 275HX, RTX 5090, 32GB DDR5, 1TB SSD $2999
$2,999
$3,999
25% offB&H Photo Video
Get Deal at B&H Photo VideoGood Deal
Bad Deal
Save
Share
Leave a Comment
22 Comments
Sign up for a Slickdeals account to remove this ad.
I'm leaning toward a desktop setup but this seems to be a good deal and portability is nice to have.
I'm leaning toward a desktop setup but this seems to be a good deal and portability is nice to have.
Sign up for a Slickdeals account to remove this ad.
Our community has rated this post as helpful. If you agree, why not thank Elon69
I'm leaning toward a desktop setup but this seems to be a good deal and portability is nice to have.
too lazy to re-read again and fix some of the formatting.
1. The Core Problem: Resource Contention in my use-case
Running local LLMs, image generation, and video generation simultaneously on a single machine creates severe bottlenecks, even with 24GB or 32GB of VRAM.
The Conflict: When running ComfyUI (media) and LM Studio (LLMs) together, they fight for VRAM and GPU cycles. Neither app respects the other's resource limits. Tuning GPU slices is largely ineffective in this scenario.
The "Land Grab": One process will seize VRAM, starving the other. If VRAM fills up, you hit an Out of Memory (OOM) crash.
LM Studio: Smartly stops loading models if VRAM is insufficient. However, if LM Studio is already loaded, it rarely releases VRAM when ComfyUI starts, causing ComfyUI to starve.
ComfyUI: If it crashes due to OOM, it often hangs or pauses (especially if launched via default batch files), killing the workflow until manual intervention.
The Queue Killer: Both apps have internal queues. When running separately, this works well. When running together, they don't share queues. They each launch one generation, slowing the machine to a crawl or triggering OOM errors.
Performance Note: Even with a Desktop RTX 5090, heavy generation (30s–600s+) blocks the GPU entirely. Nothing else can run on the GPU during this time.
2. Hardware Constraints & Model Limits
A. Laptop RTX 5090 (24GB VRAM)
LLMs: Limited to ~18–20B parameter models.
Qwen 3.5 (9B): Runs well.
Gemma 4 (20B): Usable but unsatisfying for complex tasks.
Qwen 3.6 (27B/35B): Will not fit in VRAM. If forced into system RAM, it becomes painfully slow, so I removed these models to prevent accidental loading.
Images/Video: Works, but you must use the --low-vram flag in ComfyUI to prevent OOM. I used this for LTX video models. While newer low-VRAM workflows exist, this was a necessary workaround at the time.
Concurrency: Local LLMs are significantly slower than online models (~100 tokens/sec for 35B). You cannot run multiple simultaneous coding sessions or sub-agents. For real development, I rely on online APIs for concurrency.
B. Desktop RTX 5090 (32GB VRAM)
Better Capacity: Can handle larger models than the laptop, but still faces OOM risks during long video generations.
System RAM is Critical: I observed my desktop using all 32GB of system RAM. Upgrade to 64GB+ if you can (this will drive the price to ~$4000). If local LLMs overflow VRAM, they will bleed into system RAM. More system RAM prevents crashes during these overflow events.
Performance: I hit ~200 tokens/sec on a 35B Qwen model. While fast for a single stream, it lacks the concurrency of online APIs.
How I overcome OOM hangs or other problems...
3. My Solution: The "Watchdog" Strategy
To run unmanned tasks overnight without babysitting the machine, I implemented a custom Watchdog Wrapper.
Priority Hierarchy: I prioritize Image/Video generation over LLM requests. Local media generation is significantly faster on NVIDIA chips than on my Mac.
The Watchdog Script Workflow:
Monitor: Watches memory and process status.
Kill Switch: If a ComfyUI request starts, the script instantly kills LM Studio to free up VRAM. I don't care about pending LLM requests; they are negligible now (mostly used for Meta prompting only now).
Auto-Recovery: If ComfyUI crashes due to OOM (which happens occasionally with long video clips >30s), the script detects it and automatically restarts ComfyUI. This allows for smoother, longer generations without manual stitching.
Restore: Once ComfyUI finishes, it restarts LM Studio to handle any pending LLM requests.
4. Why I Don't Use My Mac for LLMs - lol I mean I do use it but not every single time and definitely not for normal coding sessions!.
Speed: Even with 128GB of RAM, running large models like Qwen Coder Next (80B) or Qwen 27B is painfully slow (~20 tokens/sec). QCN was faster, I think 60 tks?
Frustration: Watching a local LLM process prompts and generate tokens one by one is too slow for active coding.
Strategy: I use the M5 Max for general tasks and lighter LLM inference, but I rely on online models for:
Real-time coding assistance.
Running sub-agents (Hermes/OpenClaw).
High-concurrency tasks (5+ simultaneous requests).
someone mentioned something about VMs. lol
VMs: I run VMs on my Mac (not the RTX machines) because the Mac has more CPU cycles and RAM available. ie: I host Hermes in a VM on the Mac, which calls local LM Studio on the Mac for a semi-local setup. However, true "local" agents are useless without internet access for external tools. my other true 24/7 hermes agent will use an online model. I have general sub already and I can use it until it's done.
Is Local AI GEAR Worth It?
For Learning: Yes. It allows for rapid iteration and experimentation.
For Cost: I generated ~3,000 video clips locally. Doing this via online APIs would have cost $1,000–$1,500.
As online subsidies decrease and prices rise, local hardware is becoming cost-effective.
For Privacy/Uncensored Use: Not necessary for me, as I don't fear data theft from online APIs and there's no porn coming out of my RTX GPUs.
For Coding: No. Online APIs still win on speed and concurrency.
6. Final Recommendation
If you need portability: The Laptop 5090 is great for coding on the go, but the battery dies instantly under GPU load.
If you need power: The Desktop 5090 is a workhorse, but ensure you have 64GB+ of system RAM.
Don't splurge blindly: If you plan to do overnight processing or heavy experimentation, the Watchdog strategy and proper hardware configuration are essential. Without them, you will spend more time fixing crashes than generating content. Unless someone else has a better strategy, I just decided to create something that works for me.
I have some algo to load distribute image and video requests to both machines also but that's something you can build. All of my machines are always connected to my Macbook no matter where I am also. I can generate a lot of nice graphics for example and I've them for "real work" at times.
Of course you can use nano banana and it would be cheaper but sometimes I just auto-generate 10-100 concept prompt variations and I can go back and just pick a good one I liked best. It will cost money and hit rate limits with Gemini or OpenAI's image models. There are also super cheap online image services like 2 cents a picture so ROI is very very slow. I also built app interfaces so I can create image/videos (without going to Comfyui). Again, existing online services are available, I just feel like replicating it as part of the learning just checking off a box on something I know how to do.
yes, I typed out the key points then AI-edited it but still needed some manual editing.
oh the only thing Mac is good for LLM is very very large models but I really have no need for it that. the 128GB Ram is overkill, I could have gotten away with 64GB since I doubt I will go past 35B model use on the Mac.
Any image / video generation will suck on the Mac compared to RTX GPUs. don't buy a Mac at all for content generation. I don't even bother installing ComfyUI on the Macbook despite it is M5 Max 40GPUs because I can just route image/video requests to my RTX machines - they are always connected together as long as I have internet access on my mac.
Much appreciated.
Much appreciated.
One example here where he this guy 40 tokens/s on Qwen 3.6 35B with some value tweaks on an 3060 card with only 8GB of RAM.
Also, someone gave me PC parts to build a 5080 desktop for them (and I can use it first) but because it only has 16GB VRAM, I left it in the corner for a month already cuz I don't feel like building it and deal with just 16GB of VRAM. lol.
https://youtu.be/8F_5pdcD3HY?si=
I just haven't bothered because I will still run into the concurrency issue and competing with ComfyUI. There are also turbo quants to reduce ram usage on something called KV Cache that I didn't try because I still don't need to use the local LLM heavily yet and I have enough VRAM for now.
Sign up for a Slickdeals account to remove this ad.
395 AI Max 128GB both use SoC.
Leave a Comment