April 12, 2026

I benchmarked Cali against 200 real meals

How accurate is an AI calorie tracker, really? I ran a public benchmark on 200 photographed meals. Here are the honest numbers.

I built Cali because I have zero discipline with calorie tracking. Every app I tried wanted me to scan barcodes, search databases, weigh portions — I'd last about three days and quit. So I built the laziest calorie tracker I could: just talk to it, snap a photo, or type whatever. Log your whole day in bulk if you forgot. No databases, no barcode scanning, no friction. AI figures out the calories and macros.

The tagline is literally "the laziest calorie tracker."

I wanted to be honest about how accurate it actually is, so I built a proper benchmark. Here's what I found.

The benchmark

I used Nutrition5k, Google's public dataset of ~5,000 real cafeteria meals photographed overhead with every ingredient weighed on a scale. I sampled 200 dishes and ran each photo through Cali exactly like a real user would. (This tests the photo path specifically — Nutrition5k doesn't have voice/text equivalents, but the AI estimation is the same regardless of input method.)

The results

Metric	Value
Calorie MAPE (clean)	38.1%
Calorie median error	29%
Ingredient recognition	92.7%
Latency (typical)	~3 seconds

Translation: on a typical meal, Cali is off by about 29%. Half the time it's better than that, half the time worse. It correctly identifies what's on the plate ~93% of the time.

How good are LLMs at calorie estimation when they get the food right?

This was the most interesting finding. When the model correctly identifies the food and the dish is visually clear, accuracy can be surprisingly good:

Scrambled eggs: ground truth 183 kcal, predicted 180 kcal (1.9% off)
Half bagel with cream cheese: GT 191 kcal, predicted 195 kcal (1.7% off)
2 slices pizza + steak + veggies: GT 733 kcal, predicted 695 kcal (5.2% off)
Salmon + Brussels sprouts + grain: GT 264 kcal, predicted 280 kcal (6% off)

Across the clean benchmark: 50% of dishes are within 30% of ground truth. 29% are within 20%. Macros are off by ~9g protein, ~7g carbs, ~7g fat on average.

The takeaway: LLMs actually have solid nutritional knowledge — they know how many calories are in scrambled eggs or a bagel with cream cheese. The bottleneck is portion estimation — converting what they see in a photo to actual grams. When portions are visually obvious (a single bagel, two eggs, distinct items on a plate), accuracy is excellent. When portions are ambiguous (a pile of mixed stuff, dense casseroles), errors grow.

Where I started vs ended

I ran 6 different model + prompt configurations. Started at 64% MAPE (basically useless), ended at 38% — a 40% relative improvement from prompt engineering and model selection alone. No fine-tuning, no training data.

The biggest wins:

Swapping models — landed on GPT-4.1 for this specific task after trying several. 14-point improvement for a one-line change. The best general model isn't always the best at a specific task.
Forcing the model to show its work — added a hidden portion_reasoning field to the schema so the model has to estimate grams before calories. 9-point improvement. Telling the model to "think carefully" did nothing; requiring it to write out reasoning in a structured field actually changed behavior. Without it, the model defaulted to generic restaurant portions and overestimated small meals by 2–3×.
Concrete anchors — "a dinner plate is ~26 cm, a fork is ~20 cm, vegetables covering less than 1/4 of a plate = 30–80 g." Way more effective than vague instructions.

Interesting failure: GPT-5.4-mini (a reasoning model) scored 65% MAPE — worse than where I started. Reasoning models aren't better at everything. For vision tasks, the bottleneck is perception, not symbolic thinking.

The honest limits

Invisible cooking oil/dressing — if your salad has 15 g of olive oil coating the leaves, no vision model can see that. Cali will undercount by a lot. ~6% of meals in the benchmark had this problem.
Hidden proteins — chicken under a pile of lettuce gets missed. Perception ceiling, not a fixable prompt issue.
Portion estimation has a floor — a 50 g and 100 g piece of chicken can look the same from one angle. The ~29% median error is partly this irreducible uncertainty.
This is a rough tracker, not a precision instrument. If you need ±50 kcal accuracy, use a food scale. If you want "am I eating roughly the right amount" awareness without any effort, that's what Cali is for.

10.5% of the benchmark was unfair

During the audit I found that 21 of 200 dishes had ground-truth issues — ingredients weighed during prep but not visible in the photo (chicken in a separate container, oil absorbed into food). I flagged these and report metrics both ways. Any model claiming <10% error on Nutrition5k is either overfit or lying.

What's next

Prompt engineering has hit a ceiling around 35–38% MAPE. The next real improvement probably comes from asking the user one clarifying question ("lightly or heavily dressed?") — a UX change, not an ML one.

The eval harness and all results are in the repo. Every future change gets benchmarked before shipping.

Get Cali on the App Store — free to try. Built for people who want awareness without effort. Email support@thecali.app with meals where it's way off — those are the most useful data points.