• uis@lemm.ee
    link
    fedilink
    English
    arrow-up
    2
    ·
    21 days ago

    Aren’t LLMs external algorithms at this point? As in the all data will not fit in RAM.

    • brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      edit-2
      21 days ago

      No, all the weights, all the “data” essentially has to be in RAM. If you “talk to” a LLM on your GPU, it is not making any calls to the internet, but making a pass through all the weights every time a word is generated.

      There are system to augment the prompt with external data (RAG is one word for this), but fundamentally the system is closed.

      • Hackworth@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        21 days ago

        Yeah, I’ve had decent results running the 7B/8B models, particularly the fine tuned ones for specific use cases. But as ya mentioned, they’re only really good in thier scope for a single prompt or maybe a few follow-ups. I’ve seen little improvement with the 13B/14B models and find them mostly not worth the performance hit.

        • brucethemoose@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          21 days ago

          Depends which 14B. Arcee’s 14B SuperNova Medius model (which is a Qwen 2.5 with some training distilled from larger models) is really incrtedible, but old Llama 2-based 13B models are awful.

          • Hackworth@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            21 days ago

            I’ll try it out! It’s been a hot minute, and it seems like there are new options all the time.

            • brucethemoose@lemmy.world
              link
              fedilink
              English
              arrow-up
              3
              ·
              edit-2
              21 days ago

              Try a new quantization as well! Like an IQ4-M depending on the size of your GPU, or even better, an 4.5bpw exl2 with Q6 cache if you can manage to set up TabbyAPI.

        • brucethemoose@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          21 days ago

          https://en.m.wikipedia.org/wiki/External_memory_algorithm

          Unfortunately that’s not really relevant to LLMs beyond inserting things into the text you feed them. For every single word they predict, they make a pass through the multi-gigabyte weights. Its largely memory bound, and not integrated with any kind of sane external memory algorithm.

          There are some techniques that muddy this a bit, like MoE and dynamic lora loading, but the principle is the same.