The test: 30 photos, four systems
I ran the same 30 photos through four tools in April 2026: ChatGPT (GPT-5.2), Gemini 2.5 Pro, Claude Opus 4.7, and Pl@ntNet (the community-driven plant-ID app). Photos ranged from "phone snap of a Pothos in decent light" to "blurry cutting of a rare aroid from a plant swap" to deliberate tricks — a Philodendron hederaceum that 90% of sellers mislabel as Pothos.
Each system got the photo and the prompt "What plant is this? Include the scientific name and cultivar if identifiable." Nothing else. The photos were split across five categories: common houseplants, tricky look-alikes, cultivars/variegated forms, rare species, and edge cases (damaged leaves, propagation cuttings, young plants).
Category 1: Common houseplants (10 photos)
The staples — Monstera deliciosa, Pothos, Snake Plant, ZZ, Spider Plant, Peace Lily, rubber plant, fiddle leaf fig, spider plant, and jade plant. Under reasonable lighting with a clear foliage shot, every AI-based tool nailed these.
- ·ChatGPT: 10/10, named genus and species correctly on all.
- ·Gemini: 10/10, and the only one that consistently flagged the "Monstera deliciosa" in the photo as actually being a juvenile without fenestration, not "Monstera borsigiana."
- ·Claude: 10/10, with the most consistent uncertainty calibration — explicitly noted when a cultivar call was speculative.
- ·Pl@ntNet: 9/10 — misidentified a Monstera deliciosa as Monstera adansonii (community suggestion leaned wrong on a poorly-lit photo).
Category 2: Tricky look-alikes (8 photos)
Where things got interesting. Pothos vs Philodendron hederaceum, Mini Monstera (Rhaphidophora tetrasperma) vs Philodendron vs real Monstera, Pilea peperomioides vs Peperomia polybotrya, Hoya carnosa vs kerrii.
All four tools drop meaningfully here. The general-purpose LLMs default to the more common species — so ambiguous Pothos-or-Philodendron photos skewed Pothos even when the plant was Philodendron. The diagnostic features (smooth vs grooved petiole, extrafloral nectaries, leaf texture) are exactly what's hard to see in a phone photo, which is honest.
- ·ChatGPT: 5/8. Confidently wrong on 2, hedged correctly on 1.
- ·Gemini: 6/8. When it pulls in Google Lens results, it catches more nuance — "image matches Philodendron hederaceum images more closely than Epipremnum aureum" was a correct disambiguation.
- ·Claude: 6/8. Most likely to say "I can't distinguish these from this angle; here are the features to check." Slightly lower headline accuracy, higher calibration.
- ·Pl@ntNet: 7/8. Community-verified image matches trump LLM reasoning here — people who have correctly labelled thousands of Pothos photos are a stronger signal than leaf-shape inference from a language model.
Category 3: Cultivars and variegated forms (6 photos)
Philodendron birkin vs rojo congo, Monstera Thai Constellation vs Albo Borsigiana, Pothos 'Marble Queen' vs 'N'Joy', Alocasia 'Polly' vs 'Sumo'. This is where the wheels come off the general chatbots.
ChatGPT and Claude both regularly invented plausible-sounding cultivar names — "this is Philodendron 'White Knight'" — with high confidence, when the actual plant was a different cultivar entirely. The underlying issue: cultivar-level identification depends on trade names, recent nursery releases, and lineage details that aren't well-represented in training data.
- ·ChatGPT: 2/6. Confident hallucinations on 3.
- ·Gemini: 3/6. When it falls back to image-search, hit rate goes up.
- ·Claude: 2/6, but flagged uncertainty on 4 of the 4 it got wrong. Calibration matters — a flagged "I'm not sure" is actionable information.
- ·Pl@ntNet: 4/6. Cultivars are better represented in a community dataset tagged by people who actually grow them.
Category 4: Rare species (4 photos)
Anthurium warocqueanum (velvet queen), Philodendron gloriosum, Alocasia jacklyn, Scindapsus treubii 'Moonlight'. Not rare in collector circles but rare in general training data.
- ·ChatGPT: 1/4. Frequently defaults to a common species with similar leaf shape ("this looks like Philodendron hederaceum" for a gloriosum — no).
- ·Gemini: 2/4. Lens lookup helps.
- ·Claude: 2/4, plus explicit uncertainty on 3.
- ·Pl@ntNet: 3/4. Collector-submitted images dominate this tier.
Category 5: Edge cases — damaged, young, or cuttings (2 photos)
A brown-spotted Monstera leaf (mostly yellow), a freshly-rooted Pothos cutting without roots visible. Both got identified correctly by all four systems, though the brown-spotted leaf triggered ChatGPT and Claude to diagnose the disease (overwatering) alongside ID — helpful. Pl@ntNet identifies the species but won't comment on health.
Headline scores and the calibration gap
Across the 30 photos: Pl@ntNet 23/30 (77%), Gemini 21/30 (70%), ChatGPT 18/30 (60%), Claude 20/30 (67%).
But raw accuracy is the wrong metric. What matters for someone identifying a plant is the chance of being confidently told the wrong answer. On that axis — the rate of confident misidentification with no uncertainty signal — ChatGPT was the worst performer (8 confident wrong answers), Pl@ntNet and Claude tied at 3, and Gemini at 4. An AI that says "I don't know" when uncertain is more useful for real decisions than an AI that's 10% more accurate and 30% more likely to confidently invent an answer.
Practical usage: which tool for which job
The test suggests a clear division of labour.
- ·For initial species ID on common houseplants: any of them works. Use whichever is already on your phone.
- ·For tricky look-alikes or cultivar-level ID: Pl@ntNet first, then cross-check with Gemini (which leverages Lens image search). Don't rely on ChatGPT alone here.
- ·For rare or collector plants: Pl@ntNet or a dedicated app like PictureThis; follow up in a specialist community (r/houseplants, r/aroids, Facebook groups).
- ·For care advice after ID: ChatGPT or Claude. LLMs are excellent at synthesising the care needs of a correctly identified plant, which is what you actually want next.
- ·For pet-safety decisions: never trust a chatbot alone. Cross-reference the ASPCA toxic plants database. See pet-safe houseplants.
- ·For picking up a plant from a plant swap with a handwritten label: photograph it, run through Pl@ntNet + Claude, ask Claude explicitly "what confidence level?", and trust the answer only if both agree.
The hallucination problem, concretely
The most important thing to understand about using chatbots for plant ID: when they are wrong, they are usually confident. I asked ChatGPT to identify a Scindapsus treubii 'Moonlight' from a clear photo. It responded: "This is Monstera adansonii, commonly known as Swiss Cheese Plant." Not correct, not close. A user who trusts this answer then goes to buy "more Monstera adansonii" and ends up with a completely different genus.
This pattern — wrong answer, full confidence, zero uncertainty signal — is structurally unavoidable in LLMs without tool-calling. The model has no way to express "I don't actually know what this is" because it was trained to produce fluent text, not to abstain. Claude does this slightly better because of explicit training on calibration; ChatGPT and Gemini default to producing the most plausible-sounding answer.
The practical consequence: never treat a chatbot answer as final for anything that matters — a plant you're paying for, a plant your cat might chew on, or a cutting you're about to root. Cross-check. The how-to-identify-a-houseplant-from-a-photo guide walks through the two-tool rule in detail.
When chatbots beat dedicated apps
Specialist ID apps are better at "what species is this." But there are questions where a chatbot is meaningfully better.
- ·"I have this plant, it has this symptom — what's wrong?" — chatbots synthesise diagnosis + species-specific care in one step.
- ·"I have a north-facing window and a toddler, what plants should I buy?" — conversational requirements-gathering is the chatbot sweet spot.
- ·"Translate the care label on this Japanese packaging" — OCR + translation + context in one.
- ·"My Calathea's leaves are curling — is this humidity or watering?" — ambiguous symptom triage works well when the ID is already known.


