Survivorship bias is a type of sampling bias where only data points that represent survivors of a process or event are used to train a model. This can lead to inaccurate or biased conclusions, as the model will only be able to predict the outcomes of events that are similar to the ones that the survivors experienced.
For example, consider the problem of where to reinforce aircraft to protect them from fatal bullet damage. If we only consider the planes that made it back to base, we might conclude that the most important places to reinforce are the areas that are most heavily damaged. However, this would be a mistake, as the planes that were heavily damaged but still made it back are the ones that were most likely hit in areas that were not critical.
The Hungarian mathematician Abraham Wald faced this problem during World War II. He realised that the only way to accurately assess the risk of damage to an aircraft was to consider both the planes that made it back and the planes that did not.
Wald developed a mathematical method for estimating the probability of an aircraft surviving a given number of hits. He also took into account the importance of different areas of the aircraft, such as the engines, fuselage and fuel system. This information was then used to guide the placement of armor plating.
Wald's work on survivorship bias is still relevant today. It is a reminder that we need to be careful about the data that we use to make inferences and that we need to be aware of the potential for bias.