ProgramBench, a new benchmark from Facebook/Meta(by SWE-Bench creators) to see if LLMs can recreate real executable programs (ffmpeg, SQLite) from scratch with no internet access- They all score 0%.

beep@piefed.world · 4 days ago

ProgramBench, a new benchmark from Facebook/Meta(by SWE-Bench creators) to see if LLMs can recreate real executable programs (ffmpeg, SQLite) from scratch with no internet access- They all score 0%.

DraconicSun@piefed.social · 4 days ago

But, but, but, AI coding is the future and devs who don’t use AI are gonna get left behind!!! You’re just a stupid Luddite whose job will be replaced anyways!!!

FaceDeer@fedia.io · 1 day ago

This benchmark is presenting AI with a challenge that’s greater than what human devs normally face. It’s supposed to be really hard, it’s not surprising that current models get 0%.

The point is that over time models will continue to improve and this benchmark will measure that improvement. A lot of current benchmarks have been saturated, once models are getting near 100% scores there’s no point to them any more.

kibiz0r@midwest.social · 4 days ago

It’s incredible how we went from everyone laughing at the YNGMI crypto bros to the entire economy being built on top of YNGMI AI bros.

u_tamtam@programming.dev · 4 days ago

No it’s not, and part of that is the current legislative laissez-faire in the US that put its regulatory bodies on a hiatus. Under normal circumstances, this stuff should have been under much more scrutiny and regulations. I’m not saying that the state should control what LLMs do or who’s access to them, but they could very much tackle the deceptive marketing, environmental and societal impact, unsound financing, abnormal market consolidation, and mitigate the overall financial risk.

ProgramBench, a new benchmark from Facebook/Meta(by SWE-Bench creators) to see if LLMs can recreate real executable programs (ffmpeg, SQLite) from scratch with no internet access- They all score 0%.

ProgramBench, a new benchmark from Facebook/Meta(by SWE-Bench creators) to see if LLMs can recreate real executable programs (ffmpeg, SQLite) from scratch with no internet access- They all score 0%.

./ProgramBench

Can language models rebuild programs from scratch?