

Agreed! I’m just not sure TOPS is the right metric for a CPU, due to how different the CPU data pipeline is than a GPU. Bubbly/clear instruction streams are one thing, but the majority type of instruction in a calculation also effects how many instructions can be run on each clock cycle pretty significantly, whereas in matrix-optimized silicon its a lot more fair to generalize over a bulk workload.
Generally, I think its fundamentally challenging to generate a generally applicable single number to represent CPU performance across different workloads.
Not somebody who knows a lot about this stuff, as I’m a bit of an AI Luddite, but I know just enough to answer this!
“Tokens” are essentially just a unit of work – instead of interacting directly with the user’s input, the model first “tokenizes” the user’s input, simplifying it down into a unit which the actual ML model can process more efficiently. The model then spits out a token or series of tokens as a response, which are then expanded back into text or whatever the output of the model is.
I think tokens are used because most models use them, and use them in a similar way, so they’re the lowest-level common unit of work where you can compare across devices and models.