Arioon
10/8/2025 · 3 min read · Arioon Research

The skin tone bias problem in dermatology AI — and how ITA fixes it

A deep dive into why most skin analysis models underperform on Fitzpatrick IV–VI, and the colour-science technique that closes the gap.

In 2018, a Stanford-led audit of commercial skin classifiers found error rates up to 35× higher for darker-skinned women than for lighter-skinned men. The findings have since been replicated dozens of times across dermatology AI products. The pattern is so consistent that it has a name: the fairness gap.

The root cause is colour, not data

Most analysis pipelines work in RGB or HSV. They threshold "redness" off the red channel, "darkness" off luminance, "pigmentation" off saturation. Those channels mix three things into one number: pigmentation, vascular activity, and lighting. On lighter skin, those signals are roughly independent. On darker skin, they're entangled — the baseline pigmentation level dominates everything else, and naive thresholds light up.

You can throw arbitrary amounts of data at this and it won't fully fix the math.

Enter ITA

The Individual Typology Angle is a 1992 dermatology-validated formula:

ITA = arctan( (L* − 50) / b* ) × (180 / π)

where L* and b* come from CIELAB — a colour space designed specifically to be perceptually uniform across skin tones.

ITA collapses skin tone into a single, lighting-invariant scalar that maps cleanly to the Fitzpatrick scale:

| ITA range | Fitzpatrick | Tone group | | ---------- | ----------- | ------------ | | > 55° | I–II | Very Light | | 41–55° | II–III | Light | | 28–41° | III–IV | Intermediate | | 10–28° | IV–V | Tan | | -30–10° | V–VI | Brown | | < -30° | VI | Dark |

Why it works

Once you've measured a subject's ITA, you can compare every other metric against the expected distribution for that tone group, not a one-size-fits-all baseline. Pigmentation analysis becomes "deviation from this person's tone-group norm," not "darker than light skin." That single change closes most of the fairness gap in CV-based detectors — and gives ML-based detectors a clean signal to debias against.

We measure ITA on the nasal bridge specifically. It's the most stable region across expressions and lighting, and it's the least affected by makeup or transient redness.

What it isn't

ITA isn't a silver bullet. It captures one dimension of skin variation (lightness against yellowness). It doesn't capture undertone variation, hemoglobin levels, or the impact of post-inflammatory hyperpigmentation. Those still need detector-specific care. But ITA gives you the foundation: a measurable, defensible answer to whose norm should we compare this person to.

That answer is the first thing any production dermatology AI needs to ship.