Update on Reflection-70B

79 points by mellosouls a day ago

ipsum2 a day ago

Sahil Chaudhary of GlaiveAI perpetuated fraud, where he replaced the model that he "trained" with other backend ML providers. He still has not given a reason why "Claude" the string would be missing, just magically happened, despite the base model, Llama3.1-70B having no issues producing the text "Claude" nor the dataset missing the string "Claude"!

Note that there was additional proof, besides missing the string "Claude", by matching the max number of tokens the model was able to produce. This is more technical, but chatGPT, Claude, Llama all have different tokenizers, so words are be broken up into different sections. The API consistently did NOT match the base model tokenizer (Llama), instead, producing the same number of tokens as Claude.

Companies and individuals should probably avoid GlaiveAI and Matt Shumer less they get scammed too.

lolinder a day ago

Sorry, I'm having trouble finding more information about this—what is the significance of the model being unable to produce the string "Claude"? Was this some sort of half-hearted censorship to prevent it from disclosing its name? Where can I read more?
- ipsum2 a day ago
  
  https://x.com/shinboson/status/1832933747529834747 is the best summary.
  if you don't have a twitter account: https://threadreaderapp.com/thread/1832933768031588622.html
  - lolinder a day ago
    
    Thanks, this is helpful!
- 0cf8612b2e1e a day ago
  
  Here was a HN post which links to a Reddit thread about this. Multiple users find evidence that the model was a lie. The most damning to me was that the tokenizer was not Llama’s, which cannot be explained away.
  https://news.ycombinator.com/item?id=41484981
nisten a day ago

they're 140Gb folders, for each checkpoint, yes file corruption happens
and as for the fraud part...it was an opensource model release that did not meet the claimed benchmarks when people tried to replicate it
- bastawhiz a day ago
  
  The fraud part was multiple independent sources producing fairly indisputable evidence that their "hosted" version of the model was just running GPT and Claude. That alone is enough to completely discredit absolutely everything about this work.
  As for corruption, I don't believe the excuse "yes file corruption happens". They're model weights. If this was trained (in real life) it was done on some serious hardware with disks with error correction. They weren't storing the checkpoints on microSD cards. It's certainly possible that there was really unfortunate luck and there was corruption, but I don't find that excuse to be plausible. Especially when this is your business (and launch!)
- ipsum2 a day ago
  
  Definition of fraud, from Google:
  * wrongful or criminal deception intended to result in financial or personal gain.
  * a person or thing intended to deceive others, typically by unjustifiably claiming or being credited with accomplishments or qualities.
  Since they were advertising GlaiveAI as this magical source of data where they trained a model that performed better than Claude and chatGPT, I think this firmly falls into that camp! Your definitions may be different than mine.
  - nisten a day ago
    
    it was a free opensource model release, the api was not for sale, there are literally over a million FREE models on huggingface.
    
    alsodumb a day ago
    
    Who cares if the model was free? No one said they were trying to commit fraud by releasing that model, they were trying to commit fraud by subtly advertising that their companies/products had the secret sauce to make state-of-the-art models which they obviously didn't.
- alsodumb a day ago
  
  Are you telling me someone trained a huge model, and served it for hours to tons of users, and had only one instance of the checkpoint? I call BS.
  The model being open-source doesn't mean what they could have gotten away with, or tried to, isn't fraud.
  - bhouston a day ago
    
    He served tons of people from his personal laptop? How is that possible? A 70B LLM is pretty taxing even to serve a single user let along the crush of users that tried out this new hyped model no? What am I missing?

coolspot a day ago

On one hand, I want to believe Sahil, on the other hand most of his explanations don’t make much sense:

Can’t upload exact weights he had on his computer. The guy runs AI hosting/inference/training company - can’t upload weights he has!

Original benchmark harness wasn’t shared, but had a bug that conveniently boosted model results.

API somehow mysteriously censors model name and tokenizer is exact match to Claude.

ameliaquining a day ago

He seems to be claiming that anyone can now reproduce the weird Claude censorship locally with the uploaded weights. Has anyone checked whether that's true or not, or is he mischaracterizing the allegations?
- xena a day ago
  
  I'm going to be downloading the weights and doing local verification
  - BoorishBears a day ago
    
    I think the most damning thing about this whole saga for all of AI is how much energy and attention people are giving it.
    In most established verticals, such a cartoonish scam would be dead on arrival. But apparently generative AI is still not mature enough to just move past this kind of garbage in a clean break.
    
    xena a day ago
    
    To be fair, the AI industry is used to people manifesting out of nowhere doing something stupid and then ending up with revolutionary results. It's no surprise that there's a default optimism (especially if it pans out because then that makes running high quality AI stuff so much cheaper).
    
    bubaumba a day ago
    
    > I think the most damning thing about this whole saga for all of AI is how much energy and attention people are giving it.
    That's because there is nothing better today, and nothing like it in the history.
    
    lostmsu 12 hours ago
    
    I think it is damning of the people who aren't paying attention, because this stuff at this trajectory is gonna be world changing pretty soon.
    
    refulgentis a day ago
    
    It's not a cartoonish scam, and if it was, it took 48 hours to fall apart. Not worth getting the Jump to Conclusions™ mat out for.
    This isn't said aggressively or to label, but rather, to provide some context that it's probably not nearly as simple as you are suggesting: this thread looks like a bunch of confused engineers linking drama threads from laymen on Twitter/Reddit to eachother, seeing pitchforks, and getting out their own. Meanwhile, the harsh conclusions they jump to are belied by A) having engineering knowledge _and_ looking into their claims B) reading TFA
all2 a day ago

I've seen stuff like this hacked together. If he isn't very organized or was hasty, there's a good bet he deleted the working weights or doesn't know which of 5 or 10 the weights it is.
Nothing would stop him from uploading all the weights, I suppose...
- ipsum2 a day ago
  
  No. He served the "weights" (actually Claude) for over 24 hours. It's practically impossible to have served the "correct weights" and just have lost them.
  - Havoc a day ago
    
    >It's practically impossible to have served the "correct weights" and just have lost them.
    Deleting files is very much a thing
    
    minimaxir a day ago
    
    The AI dog ate his homework?

thorum a day ago

> This along with a few tokenizer related tests people ran, made people suspect that we are just serving Claude with post-processing where we filter out words like Claude.

Didn't these "few tokenizer related tests" prove the API was using Claude's tokenizer instead of Llama's, based on how words were being divided into tokens?

That's a hard one to explain (it doesn't appear they're even trying to).

refulgentis a day ago

People keep asserting that but, really, it was just people pointing to setting max tokens to a certain value and getting a certain # of words out. They didn't actually have tokens. Perfectly possible to have collisions, I'd wager even likely in the scenarios they tested, simple question, < 10 tokens, in English.

Havoc a day ago

An expensive lesson in how fragile reputations can be

bhouston a day ago

I am confused. He was hosting the 70B LLM everyone was demoing from his laptop? How can that serve the load? When I’ve run LLMs locally it is really taxing for just one concurrent session.

nisten a day ago

Has anyone here actually ran the code on their own hardware yet?

I did a standard non-middleware lm_eval_harness and got 0.3214 on gpqa_main_zeroshot WITH the systemprompt and 0.3616 without the systemprompt.

Have not ran it with the middleware yet that's supposed to do the substraction. Now, if that adds 20% to the score, that would be a huge deal, but it would also roughly match the jump from gpt4o to o1-preview that they got in gpqa_diamond.

kristianp a day ago

If this is for real, in some ways it shows how small OpenAIs moat is. Once someone knows something is possible and the rough idea, the community can replicate it in 4 weeks.

jsheard a day ago

Isn't Reflection supposed to be based on CoT like o1? It was originally released a week before o1 was, so if it was the real deal all along then OpenAI were outright beaten to the punch rather than replicated after the fact.
- ipsum2 a day ago
  
  No. CoT has been around for several years (Jan 2022) https://arxiv.org/abs/2201.11903. And so has Reflection (March 2023) https://arxiv.org/abs/2303.11366. This approach that was taken by Reflection is nothing new.
- thorum a day ago
  
  CoT moderately improves model performance, but all non-o1 models suck at actually thinking step by step effectively. If the task is not straightforward, they make obvious mistakes or default to guessing.
  OpenAI trained o1 to pick better steps in its chain of thought. (The moat is the dataset they used to do that.)
- bastawhiz a day ago
  
  Maybe, if it wasn't an outright fraud. Arguably they didn't beat anyone to anything.
  - refulgentis a day ago
    
    > Maybe, if it wasn't an outright fraud.
    I mean, it obviously wasn't, did you read the thing we're commenting on? n.b. At this point, you have all you need to replicate it. Far shy of "outright fraud", though, I'm sure there's a bailey for that motte.
    
    bastawhiz 12 hours ago
    
    It's indisputably fraud. Multiple people after the original launch showed strong evidence that the hosted model they produced was simply proxying Claude and prompted it to censor its own name. It genuinely doesn't matter what they say, they committed blatant fraud, went out of their way to hide it, and now they're pretending like that didn't happen.
    The results might be perfectly reproducible, but their reputation is completely burned. This is not how you launch your company.
    Even if you don't care they didn't release anything at all of substance before the o1 launch. They didn't release usable weights, they didn't ship a working hosted model of their own. So no, they didn't beat OpenAI to anything.
    
    refulgentis 10 hours ago
    
    > showed strong evidence that the hosted model they produced was simply proxying Claude and prompted it to censor its own name.
    s/strong/extremely weak from my perspective, also, see article
    > They didn't release usable weights,
    Yes they did. They just weren't benchmarking the same as the initial claim.
    > they didn't ship a working hosted model of their own.
    Yes they did. They just weren't benchmarking the same as the initial claim.
    > So no, they didn't beat OpenAI to anything.
    Not sure where the idea they "beat OpenAI" is coming from, certainly not from me. I agree they did not.
    > It's indisputably fraud.
    This is indisputably incorrect, as I am disputing it.
    Happy to talk it out, don't take my shortness as being disagreeable. In general, people handwave about tokenizer[1] or "Claude" missing in a response[2]. I honestly expected the HN thread here to be far more insightful, instead, I'm seeing that its indisputable it was fraud, based on repeating a couple observations gooner teens made last week, then made vast conclusions on. Which were obviously wrong if you looked at it as an engineer.
    [1] no one can get the actual tokens out of an API. gooner local LLM stans were setting max tokens to some number <= 10, asking the same question of both, and seeing answers of similar length. This is mundane and expected, especially in english, at such a short length. I expected technical observers to, at least even if they don't grok tokenization, to note they weren't able to get the same responses with temperature = 0.0.
    [2] covered in article
    
    bastawhiz 7 hours ago
    
    I'm really not going to argue with you, because when faced with lots of little bits of compelling evidence from a whole bunch of sources (showing their work) versus the word of some guy on the Internet, I'll believe the evidence. Sahil didn't actually refute any of the concerns around the hosted model, he just acknowledged that it was weird and said he didn't know why. Great. That's useless. So much for "looking at it as an engineer".
    But what's great about the passage of time is that people can actually take what's presented in the article and try to replicate the benchmarks. And now that it's Friday, October 4th, and we've got this gem:
    https://x.com/mattshumer_/status/1842313328166907995
    So frankly it's still a fraud, now because even after the postmortem the results are still not reproducible. That's the whole point, right? That it does what it says on the tin. And it doesn't. This whole process could be a shell script that downloads and runs. They've had more time than they should need. Now, it's gone from a shell game to plain old academic dishonesty. If this was a published paper, it would be ripe for retraction.
- nisten a day ago
  
  yes and the massive increase to GPQA scores from o1 was attributed to this technique, so there is something there, despite the hard feelings of unproductive reddit users

alsodumb a day ago

I don't trust Sahil and Matt. They tried to commit fraud, hype things up, but it got way too much attention than what they expected, so they tried to get away with just serving Claude/ChatGPT in the background but got caught. They are nothing but grifters who got caught and now trying to fix that image.

m3kw9 a day ago

Is either fraud or incompetency, I say is just incompetency like what he said. Got too excited on some false test maybe they tested with some validation data mixed in

bastawhiz a day ago

So when they put the hosted model online (which was actually just proxying Claude), they explicitly prompted Claude to censor its own name. That's not explainable with incompetence. It's very intentional deception.
- jazzyjackson a day ago
  
  I just can't facepalm enough seeing so called AI companies relying on pythons .replace() when they need to hide what service they're building on

blackeyeblitzar a day ago

Past discussions about this model and its controversies:

Reflection 70B, the top open-source model https://news.ycombinator.com/item?id=41459781

Confirmed: Reflection 70B's official API is a wrapper for Sonnet 3.5 https://news.ycombinator.com/item?id=41484981

nisten a day ago

There is no official API, you're confirming a temporary one that was taken down weeks ago. That was done via OpenRouter, a routing api service/site, which routes to different model's under load.
Yes they could've switched it themselves too.

daguava a day ago

[dead]

ilaksh a day ago

The models are very powerful. This can help anyone, including scammers. The number of scams will be enormous.

I just hope people are also able to see how useful it is for non-scumbags and don't let the scammers ruin it for everyone else. I am not going to mention another similar category of technology in this regard, just to stay "politically correct" for this site.

talldayo a day ago

> I just hope people are also able to see how useful it is for non-scumbags and don't let the scammers ruin it for everyone else.
I hope so too - it's been four years since GPT-3 came out and I haven't found a single serious time-saving application for the technology.
If someone doesn't start making money with LLMs soon, then it will only be the scammers who benefit!