I’m doing a lot of coding and what I would ideally like to have is a long context model (128k tokens) that I can use to throw in my whole codebase.

I’ve been experimenting e.g. with Claude and what usually works well is to attach e.g. the whole architecture of a CRUD app along with the most recent docs of the framework I’m using and it’s okay for menial tasks. But I am very uncomfortable sending any kind of data to these providers.

Unfortunately I don’t have a lot of space so I can’t build a proper desktop. My options are either renting out a VPS or going for something small like a MacStudio. I know speeds aren’t great, but I was wondering if using e.g. RAG for documentation could help me get decent speeds.

I’ve read that especially on larger contexts Macs become very slow. I’m not very convinced but I could get a new one probably at 50% off as a business expense, so the Apple tax isn’t as much an issue as the concern about speed.

Any ideas? Are there other mini pcs available that could have better architecture? Tried researching but couldn’t find a lot

Edit: I found some stats on GitHub on different models: https://github.com/ggerganov/llama.cpp/issues/10444

Based on that I also conclude that you’re gonna wait forever if you work with a large codebase.

  • shaserlark@sh.itjust.worksOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    7 hours ago

    Thanks for the reply, still reading here. Yeah thanks to the comments and reading some benchmarks I abandoned the idea of getting an Apple, it’s just too slow.

    I was hoping to test Qwen 32B or llama 70b for running longer contexts, hence the apple seemed appealing.

    • brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      edit-2
      4 hours ago

      Honestly, most LLMs suck at the full 128K. Look up benchmarks like RULER.

      In my personal tests over API, LLama 70B is bad out there. Qwen (and any fine tune based on Qwen Instruct, with maybe an exception or two) not only sucks, but is impractical past 32K once its internal rope scaling kicks in. Even GPT-4 is bad out there, with Gemini and some other very large models being the only usable ones I found.

      So, ask yourself… Do you really need 128K? Because 32K-64K is a boatload of code with modern tokenizers, and that is perfectly doable on a single 24G GPU like a 3090 or 7900 XTX, and that’s where models actually perform well.