Last time I discussed some topics in the ethics of communication. Unlike many of the columns, this one actually got answers! Thanks to all (and to the readers who participate, albeit silently). Matthew suggests that focusing too much on how matters could be taken impedes communication. I agree, and note that it cuts both ways. Alex suggests using context explicitly, with disclaimers. Maybe so. But what if they are taken badly? David raises the question of non-verbal communication, which is an interesting topic for sure, but one I have no inclination to address at present.
Names as Proxy Indicators
Proxy indicators are used in scientific research when the property of interest cannot be directly measured. This is routine and found in sciences from physics to sociology. However, ideally these are backed by well established laws. One non-ideal situation that is interesting is when associations are used: the proverbial statistical correlation, for example. A large industry now centers around this notion: Google makes it, in a way, their entire business.
Suppose we wanted to guess your ethnicity. There is the objective: This is your actual country of origin or an ancestor. There is also the subjective: Your personal ethnic identity, to which you feel you belong. One fallible proxy indicator for your ethnicity is your name. For example, someone with either of my names is likely to have a Scot in their ancestry. Yet we all know of Frederick Douglass who (second S aside) shares my family name, and who was not Scottish.
Let’s think about what would be potentially wrong with using names as a proxy indicators for ethnicity. Does it matter that the tools built to “guess” might be completely opaque, in the sense that we may not ever understand why they make the determination they do? Which failure mode do you prefer: “Can’t tell” or “The person with name X is likely to be Y” or “The person with name X is N% likely to be Y” or others? Does your answer depend on whether the name is compared to uses of that name with known origins, or on structural features of the name? For example, people with my ethnicity are unlikely to have “ng” as a syllable-initial sound in their names. Compare that with most “François” being French.
Imperfect data here can have very strange results. For example, if the system is only fed Z as part of “Chimendez” and “Lopez,” it might think that “Bozas” is Spanish. And the latter is supposed to contain a character that does not exist in ISO-8859-1, a common character standard. What character encoding should one use? Why? What preprocessing? Should the system allow someone to write a Persian name in Sinographs following a Thai family name in Biblical Hebrew?
Do some types of “getting it wrong” matter more than others, based on who the individual is? For example, Iranians often have Arabic given names for religious reasons, but may find it terribly insulting or at least stupid to be called an Arab. How does one handle this sensitivity without actually knowing the answers to the question one seeks in the first place?
How does this topic intersect with “right to be forgotten”? Do your answers vary depending on what reasons exist for the data? I find it plausible that people will not expect their names to be a matter of statistical investigation; do my readers share that expectation? Does this expectation make any difference?
For more, check out this fun (but also serious) article called “Falsehoods Programmers Believe About Names.”