Alts (mostly for modding)

@sga013@lemmy.world

(Earlier also had @sga@lemmy.world for a year before I switched to lemmings)

  • 42 Posts
  • 686 Comments
Joined 7 个月前
cake
Cake day: 2025年1月16日

help-circle











  • sgatoPrivacy@programming.devWhat is post Aurora Store?
    link
    fedilink
    English
    arrow-up
    1
    ·
    3 天前

    I would have to get playstore back probably (that would mean i would have to make a google account now). I need whatsapp and banking apps. maybe whatsapp can work with direct download from their website (it used to work in obtanium too, but stopped), but banking apps are notorious. they want to be installed by playstore, so i have to keep playstore installed but disable it, and that can “sometimes” (i have to keep enabling/disabling play services and play store until it sometimes works) trick the apps that they were installed by playstore.






  • heuristic is correct, but there is more to it. these models are MoE(mixture of experts) - essentially tinier models which are more specialised. at a given time, n out of m may be active (usually n is 2, i have heard there is not much increase with increase in n beyond 2). you can consider something like 1 expert is in ball park of 2B, and lets say there are 8-9 of them. then there are some extra “mandatory layers” which have to be there (you can assume that they orchestrate stuff)(at least that is how I have understood this). benefits of these “sparse” models is that at any given time, only n of them are being used, and hence compute is faster. most big models are moes (the deepseek 670+, or kimi 1000+), with size of experts never practically exceeding 30s (B). the largest open dense model afaik was the llama 405B, and below that some 100+ cohere and mistral stuff, but that was 1+ year ago (decades in ai terms). Industry found that moes were in general better bang for buck. moes have disadvantage that you have to hold all of the model in memory (not necessarily vram, ram is fine too, but ideally you should not be reading from disk) (the n active experts out of m keep on changing, so for each expert, if you have to read from storage, it is that much slower.), but actual compute happens with n experts, so it is fast. as a general rule of thumb, you can consider that a moe will always be stupider than its equal weight dense (single large model), but much faster. but training large model gets very hard. so you train many small models, and package together (you do not really train them separately, but that is a separate discussion).

    now back to vram requirements - if using non quantised model, each param will take 1 or 2 Bytes of space in memory (fp8 or fp16). now 1 Billion bytes = 1 GB memory (not GiB), so 20B fp16 model would be ~40GB model. here fp16 is floating point number, with 16 bits. that is a representation of real number with finite precision. imagine like π extending to infinte digits, but usually you do not need all the digits (famously, nasa uses something like 15 or 16 digits after decimal) (in this case 16 bits include the mantisa and exponential and sign bit (1 for sign, maybe 11 bits for mantisa and 4 for exponential, but read about this online, it is not that important for current conversation)). but usually, you do not need all this precision while running the model. so we quantise these models (imagine rounding to lesser precision). a common quant is q4 (4 bits as opposed to 16). but there is some stuff we do not round that much, we do it to something like 8bits (these gives us the extra letters in q4"km"). Quantisation is not a free operation. what you gain in reduced memory usage, and slightly faster computation, you loose it in smartness, but trade off is usually worth it. You can measure the amount of smarts you loose by measuring something called “perplexity”, and yoy measure the amount of difference from the unquantised model. the lower the perplexity, the better the quant. usually, you do not want to go below q4. but for huge models (670/1000+), you have to use something like q1 or q2, because you got to do what you got to do to atleast run it. So you can imagine that q4km would roughly require 0.6 times number of params, so in this case, something like 12-13 GB. You have to hold all this in memory.

    Ideally, both your context plus model weights should be held in vram, and size of context varies by model, but i usually do not think much of context (32k context requires close to 10GB of vram iirc, my stuff never usually goes beyond 4-6k).

    many people can not afford much of vram, so run cpu and ram combos.

    in this case, i have heard that with 8GB vram, and 64GB system memory, someone ran gpt oss 120b for roughly 10-11 tps, which is quiet fast imo, because active params ~6B.

    but user here has a total of 40GB, so the best they can do is 32GB dense, so qwen3 32 B is good, but I think that would not be fast on that hardware, so i recommended something much smaller.


  • you have plenty of vram to run some moes. community opinion is not good about it, but maybe try gpt-oss 20b. afaik it has 3-6 B active params, and rest can easily fit in your plenty of ram. if you use llama.cpp, there is a new -cpu-moe flag for mixture of experts. i think you can get in ballpark of 20 tps, and that is very fast imo.


  • do you mind running stuff locally? if not, then you can try running new qwen models (2507), whichever is largest and fits in your ram + vram (never go above q4km unless models are tiny (4B or lower) - then try q5km, higher quants are not that useful, but better performance by running faster is).

    for api stuff, maybe try hugging chat? if you make a account, then you can use multiple models, with (arguably) better privacy. there are more generic inference providers, but I do not know much about to trust them.