No.2334
There's been a lot of chatter lately about Deepseek. In the online circles I'm in, people have a politics-colored understanding, more or less saying "American tech companies couldn't do this, but an opensource Chinese company could and American tech companies are in 'damage control'". Which... I really don't understand. If it's an open source model, like Llama was, for example, I don't see how this doesn't just cause there to be a proliferation of much more efficient and performant models -- the same way after Llama became available, sudden there was Phi from Microsoft, Gemma from Google, Mistral, and others.
What does /maho/ think?
No.2335
I think the average person has no idea what the fuck open source means.
No.2336
Also... I find these benchmarks dubious to be honest. Practically every benchmark says that they're the best. If you look at a Phi benchmark, it'll say they're the best. You look at a Gemma benchmark, it's the best. A Qwen benchmark, they're the best. And so on...
Does anyone actually have any first-hand experience to say whether Deepseek is actually any good? To relate back to how all these benchmarks are essentially cherry picked BS, Phi says it's great. Well, it's alright, but it's heavily censored and shit to actually use, even if it is fast. Gemma is much the same. Qwen is fast, and it's not completely censored like the aforementioned, which makes it much more ideal to use, even if it's complexity isn't as high. Llama, however, stands out in my experience as generally providing the best responses, at the cost of being just a bit slower than all of the other three.
I'm very curious if Deepseek is "great in benchmarks!" only, and it has a neutered ability to perform actual conversational tasks, but is good for more specific knowledge-related tasks, or if it's more conversationally optimized the way Claude and OpenAI's models are.
No.2337
From what I've gleaneded the whole issues surroinding deepseek is that you can make an LLm on par with what OpenAi has been putting out without massive capex expenditures for hardwar
No.2339
you can test the results yourself and I figure a bunch of people have anyways.
It's that China has been sanctioned from getting high tech equipment but Chinese scientists are lke 'lmao capatlism is so inefficient'
No.2340
Is this only for chatbot AI? Are they planning to do this for image/video gen or speech AI? Chatbots can't make anime so no matter how smart they get they're still boring and useless.
No.2342
>>2340They have an image thing out now
>>2333
No.2343
>>2337>without massive capex expenditures for hardwareI wouldn't say that. If you look at their benchmark, they suggest that OpenAI o1 is 1217B parameters, whereas DeepSeek R1 is 607B parameters according to their GitHub. Half the number of parameters is certainly significant, but I would hardly say that's any less of a massive capital expenditure. 1B parameters is roughly 1GB, so DeepSeek R1 would still require approximately 607GB to run. 8x H100s is certainly more affordable than 8x H200s, but that's not really saying much...
This is all very confusing because it seems like a lot of people who know absolutely nothing about LLMs have been making statements that they don't realize have actual meaning. I was hoping someone on /maho/ would know more, but that doesn't really seem to be the case so far...
For instance, people have also been talking about efficiency a lot, presumably in reference to tokens/s relative to billions of parameters?... But I've yet to see anything suggesting that DeepSeek R1 is any faster in terms of tokens/s per billion parameters. Now, maybe that's complicated by DeepSeek R1 allegedly performing at the level of OpenAI o1, because that in and of itself would be a marked improvement in efficiency, which was why I posed the initial question in the OP of whether it's closer to a Mixture of Experts-type model (which tend to score highly on benchmarks because they have tuned datasets to perform well at knowledge tasks, but suffer in conversation), or whether it's a broader, more generalized and conversational model (Such as OpenAI GPT4 or o1, or Anthropic Claude), which -- despite their size -- are far more capable and excel in
both conversation and knowledge tasks, at the expense of requiring much more memory.
It wouldn't really be that impressive if DeepSeek R1 is good for knowledge tasks, but useless for conversation; and by "useless" I mean the ability to mold the style and type of response. To give an example, MoE models tend to be able to designed to respond in the following way: "What is the circumference of the Earth?" ... "[Circumference of Earth]", whereas generalized models can do much more complex things like "Format a socratic dialogue on the nature of kinematics in pirate speak" ... "[Characters discussing kinematics in pirate speak]".
No.2352
I don't work in tech and I already am well aware how degraded and puffed-up America is, so none of these developments surprised me.
>>2339All this petty behavior has made America' s "brightest" just look pathetic, and despite subtle nudges from China, it really is mainly all self-inflicted.
I wish I could get in on this but even before that I need all new PC hardware to use it effectively. Those budget offline models don't seem impressive to me.
No.2359
>>2336It's on par with o1 on several noncheatable third-party benchmarks, and the RP community generally think it's comparable to Claude (RP SOTA).
There are pros and cons but R1 sticks to the prompt much better than everything else which makes it less censored.
No.2360
China dropping DeepSeek R1, an open source AI model rivaling ChatGPT o1 for 2% the monthly cost (or whatever their best one is called), right as they established the plan to invest $500 billion to AI is as funny as things can get. They even, apparently, did it with NVIDIA limiting its best GPUs from being sold in China. Is DeepSeek lying about their expenditure of a measly $6 million? or are the american companies lying about their high expenses just so they can pocket the rest...
No.2361
>>2358There are already several (gimped) models that you can run locally on regular consumer hardware, but they obviously perform way worse than the big boy stuff.
That being said, the required hardware demand chart looks like a stairway constructed by a drunk person. Every once in a little while the demanding specs drop like a rock all at once. There's no telling when server grade AI of today could run on a run of the mill GPU, but it's probably going to be surprisingly soon. Hell, everything about AI has been surprisingly soon
No.2364
>>2334>people have a politics-colored understanding, more or less saying "American tech companies couldn't do this, but an opensource Chinese company could and American tech companies are in 'damage control'". Which... I really don't understandPeople are literally paid to do this.
No.2365
>>2364We are also at the point where AI companies use AI to defend themselves in online discussions. People then read this and parrot their talking points. I hate this decade so much now.
No.2370
What do people use these for? Seriously?
No.2371
>>2368
it's the joke that people are making criticisms no one cares for. You can locally host it without any of these issues anyways. I think.
The blocking is in china
No.2372
A much better joke would be "ask ChatGPT what percentage of american billionaires are jewish"
No.2373
AI is bad and Kissu only likes it because of contrarianism and people only like this one because they think the of the enemy of their enemy is their friend even if that friend is a homophobic pedophile
No.2374
ai is good at translations
No.2376
>>2373but AI is good at making anti-contrarianism art and homoerotic anti-pedo art
No.2377
>>2366I also did a very cursory little bit of reading. As I was expecting, Deepseek is indeed a Mixture of Experts model like I had imagined. I now understand what all the hubbub was about. As a MoE LLM, Deepseek only requires a subset of the "experts" (essentially smaller refined datasets on a particular topic) at any given time when generating a response. This is in contrast to the more typical approach where all parameters of the model are required at once to generate a response.
This has significant advantages because it means, for example, it has not only significant memory savings but it also means that extraneous information doesn't need to be considered when generating a response. For example, if the prompt is in English, you can activate the "English Expert" only, and not have to activate the parameters necessary to respond in Chinese, or Hindi, or German, etc. This same division of experts can be done across any number of topics: history, mathematics, philosophy, literature, media information, slang, etc.
From a purely hardware-constrained perspective, we would obviously expect the MoE model to perform better compared to a traditional LLM. From a more design philosophy-oriented approach, the disadvantage is that because not all of the parameters are being activated at once, you may lose some of the cross-pollination and latent association that a traditional LLM would excel more readily at.
From just a little bit of testing, Deepseek R1 does seem impressive, but I feel like it's probably more comparable to GPT4o mini, rather than GPT4o, or Claude Sonnet. It's hard to explain, but at times it feels a bit "local model"-y, probably because of the limited number of parameters being activated at any given time. The ability to respond well is certainly there, but the depth and style of response feels slightly lacking.
No.2378
>>2374It's not, but it's better than I am, and sometimes that's enough.
No.2379
>>2378how many japanese erogames do you play per month
No.2380
>>2379Approximately zero because dekinai, but that may change if human translators don't step up their game.
No.2381
>>2380I play somewhere in the range of 10-30. Human translation produces higher quality results that are only desirable from the translator themselves. For the general audience the vibe is all that's required and the artwork fills out the rest. Deepseek's translations in a recent game I played were in no way noticable as an AI translation.
No.2382
And this was in V3. People have said on Twitter that Deepseek's ability to work in Sri-Lankan languages is very effective. This project is going to be a very cheap way for China to effectively communicate with countries the US has abandoned
No.2383
>>2381>For the general audience the vibe is all that's required and the artwork fills out the rest.Absolutely horrible attitude towards translation but what should I expect from AItards
No.2384
>>2377>This same division of experts can be done across any number of topics: history, mathematics, philosophy, literature, media information, slang, etc.As a bystander who has only kept up with new developments from the sidelines from a high level perspective, this sounds like it could be a major breakthrough.
My biggest frustration with the direction of recent AI development has been the split between human-readable models based on formal logic with limited domain "knowledge" (depending on your definition of knowledge) and the almost unauditable probabilistic machine learning models with broad focus but low reproducibility that have been dominating the hype cycle for the past few years. It's almost like we're now approaching something that resembles the human brain's ability to coordinate between specialized subsystems.
>It's hard to explain, but at times it feels a bit "local model"-y, probably because of the limited number of parameters being activated at any given time. The ability to respond well is certainly there, but the depth and style of response feels slightly lacking.That's to be expected from a naive combination of domain-specific models. I could see that improving once someone develops a higher-level pattern recognition model to determine which domain-specific models need to be invoked when, which tokens serve as the bridges between domains, and how strong the relative weight of each domain model should be based on the strengths of their internal correlations.
I'm not sure if this makes any sense but it's the best I can do at 2 AM while boozy.
No.2392
>>2384>from a high level perspective, this sounds like it could be a major breakthrough.It certainly could be. For one thing, unlike a traditional LLM, because only certain experts are invoked at a time, you don't need to train the aggregate size of the LLM. With a model like OpenAI's, which have some 1200B parameters for their models, you need to have the memory to fit the entire model to train it. With an MoE, you only need to train the experts, and then the gating model to coordinate the experts.
>I could see that improving once someone develops a higher-level pattern recognition model to determine which domain-specific models need to be invoked when, which tokens serve as the bridges between domains, and how strong the relative weight of each domain model should be based on the strengths of their internal correlations.You've actually got it exactly correct! This is how a MoE LLM works. A gating model determines what experts to invoke, and then applies weights to the experts based on how strongly related they are to the prompt.
It would be really great if this paradigm becomes standard and it enables extensible LLMs, the same way stable diffusion has LORAs (as
>>2358 mentioned), you could hypothetically imagine being able to drop in a "roleplay" expert. That might be just as explosive for LLMs as llama was for local models, and as stable diffusion LORAs were for fine-tuning local image generation.
No.2400
actually I'll toss in it /secret/
No.2401
>>2381>play somewhere in the range of 10-30.are eroge really that short? i thought a guy wouldnt exhaust the average eroge in 1--3 days
No.2402
>>2401rpgmaker titles and such
No.2403
>>2401dl-site popcorn porn is like 10 minutes to 2 hours while other titles that get physical disk releases can go upwards of like 60 and beyond hours, and there's a variety of just about everything in between
No.2404
I guess saying AI is bad at something is considered politics now.
No.2405
>>2360>Is DeepSeek lying about their expenditure of a measly $6 million? or are the american companies lying about their high expenses just so they can pocket the rest...The $6 million was the cost for the final viable model training run, so it's a bit misleading. It's like saying something took $6 million in raw materials. You won't be able to do anything with that unless you have the infrastructure, knowledge, personnel and other stuff that took a lot of money to get you into that position.
If it really did use a lot of data from ChatGPT and Claude then I imagine they saved a lot of money there, but I really don't know how that works. Basically if they didn't have those two to build upon then it wouldn't have been so cheap.
Going forward that means they will need better models from other companies to borrow from or they'll stagnate.
>>2398I think nvidia was just paired with the others as a group, although nvidia was also knocked down a peg by this not using CUDA which is the exclusive nvidia tech that had until now been VERY closely tied to AI. Stock market stuff is full of idiots, obviously, so anything could make value go up or down. I think the feeling of invincibility and predictability was shattered, though, and that's why it went down.