A digital copy of human ― Can a human be copied?

The Scientific American article is about creating a digital copy of human by 1,000+ questions and answers. Will it work? well… read the article, and you will see the writer was a bit disappointed.

Then, why it does not feel like another AI magic played in this project?

Let’s go in a bit more depth.

A person can be decomposed to million factors. Or, could be only 1,000 factors (or less)

Is it true? Shouldn’t be the list of factors like infinite?

Anyway,s, for starters, deep Learning essentially is a multi-layered version of factor analysis.

Each layer is N combination of factors, containing (theoretically) exactly the same amount of information from another layer. So, in theory, there can be K layers of possible representation, each with different N variables. Like in stage k, there will be N\_{k1}, N\_{k2}, N\_{k3}, …, N\_{kN}.

(If you’re not into the theory, feel free to skip the next three paragraphs.)

The N\_{kn} variable can be anything. Like the income level of age 30, or 40, or 50. Or wealth level of age 30, 40, or 50. You can go in a bit more detail. Like income level of age 30.52, like 6 months and a few days after the 30th birthday. For some people, the number will be near flat across ages, or others may have very little growth over time. For a few people, there will be a series of spike. There certainly a group of people with flat 0 all along.

There just will be infinitely different possible diverse sets. And, that layer is a representation of a person by income level of different age, with some extra information. That might be good enough to build a person’s digital copy. It’s just one layer of long deep learning network, and as long as your representation in that particular layer is right, then, you can transform that data into different format in the next layer.

This is how deep learning works, when it extracts features of a cat’s image. After a long series of layers (or data transformation), at certain point, the transformation is good enough to discern the animal in the image, be it a dog or a cat.

(Resume here if you skipped ahead.)

For human, at certain depth of transformation, it can be wealth and income level, education level, and/or how childhood trauma affecting next 50 years of life. How to design such a network is an almost impossible task, but this is what the deep learning is all about.

Digital copy of human

If that’s applied to copying human digitally, if all possible data can be scooped into the model, in theory, we should be able to copy a human.

Depening on where and how you look at it, the digitalized human can have different aspects. From DNA perspective, even with identical DNAs, we still see diversions from biological twins. Therefore, the digitalized complete copy must have both DNA and all environmental effects. Otherwise the layer will end up with incomplete information, thus next stage transformation will also suffer from bias from mising data.

Just not sure how we can model the environmental effects, but if possible, that can stand as one layer in the deep learning model.

The similar idea has been displayed by a movie, called Another Earth, starred by Brit Marling.

(She seems to have some favor for similar lines of multi-verse style story, so if you like to entertain more on this stream of possibilities, check ‘The OA’ Season 1 and 2 on Netflix. It’s known that she originally planned 6 seasons, but sadly Netflix cancelled additional funding after Seasion 2. So, not that popular idea, I guess?)

Exactly the same DNA, but on another earth, the life style of each individual is quite distinctively different.

If truly so, copying DNA does not help us to copy a person. We can only make a ‘biological copy’. How can we fill in the missing environmental variable? I teach students that missing variables are one of the key causes of endogeneity. Without additional factor that can clean out the endogeneity, the model is ridden with bias.

(If you don’t know what endogeneity is, I personally don’t think you are qualified to run any sort of data-based project. It’s like claiming a law degree without high school education. Google the term or go to free courses provided by SIAI Square.)

Statistical perspective of digital copy of human

Then, what if we completely get rid of dependence on DNA, and just focus on personality, for which to create a ‘digital copy’? At the end of the day, we don’t need to copy a human. We only need a digital version.

As long as we can build one layer of factors on a Deep Learning model, we can just do autoencoding to transform the data with different sets of variables. So, excluding the biological part, the challenge may become simpler?

From the long list of observation data, questions and answers, we may be able to build a proxy for the personality. If the data is sufficient, in theory, the model may mitigate the ‘missing data’ problem.

That’s statiscal perspective of the Scientific American article’s test.

It’s bound to fail. Reading the lines of the article, I already knew that the writer must have been dissatisfied, and indeed so.

Why it failed to create the digital copy?

To construct the realistic ‘digital copy’, you need a lot more information than just 1,000 questions and answers. If this kind of ‘childish’ or ‘engineeristic’ exercise could be successful, the neuro-science field must have developed way more than what we have now. I think they would have created an actual copy of a human like we see in other SF movies. (And, I think human kind must have not evolved upto this level, had our true self been that simple.)

The reality unfortunately is far diverse, which is well described by the aforementioned movie, Another Earth. Just one tiny difference in enviornment, despite the same DNA, can make distinctive differences.

The reason that I don’t like to associate with the ‘AI Engineers’ without proper scientific knowledge also can be discussed at this point. The Korean researcher in the Scientific American article must have thought that 1000 answer may be enough to cover large portion of human. Since the response by the writer is not positive, he will probably just extend the list of questions. But, such an engineering approach is not going to give him anything.

Why not?

Because whatever he does with long list of questions, from the perspective of factor analysis, he won’t be able to add more factors. More questions still will likely fall into the same factor. Like redundancy. You just need to pull a new factor, but that’s not easy to do so, just by asking questions.

For example, with Wharton Research Data Services (WRDS), in my first year of PhD, I tried with 250+ variables to explain financial returns. After a simple run of Principal Component Analysis (PCA), I could see that only 3 factors covered over 75% of variances. I extended that exerise to co-variances (Factor Anaylsis) and beyond 3rd moments (non-linear PCA), but the results were similar. (I can go on more theoretic explanation, but let’s save that for MSc/PhD courses.)

Even wtth 10,000 questions, the number of additional factors that the project team can add will be barely a few more. In other words, you will end up with incomplete data.

With incomplete data set, either your dream to create a ‘digital copy’ of a human should be compromised to a digital chatbot, or the ‘digital copy’ will just be riddled with endogeneity by missing data.

To AI team managers, AI Engineers, and AI Dreamers,

Learn some statistics. Plz. Not just mean and variance. Learn the concept of endogeneity by missing data, measurement error, simultaneity, for example. These are the topics that typical British 2nd year undergrad programs teach. I saw LSE’s 2nd year in Economics and Data Science students are fully equipped with such knowledge.

With that undergrad junior level understanding, you wouldn’t waste your money and time for such a shoddy project. I don’t know who is funding that Scientific American article’s project, but if I were the top management of the ventura fund, I would have turned down that project and I would have fired the investor team, if they had committed already.

In fact, this is why all VCs need to learn some statistics, at least up to the British 2nd year undergrad level, so that they can screen out such a fruitless project early on. And more importantly, this is why AI Engineers should have some serious statistics training.

I come across a lot of stories like the Scientific American article. AI Dreamers may have some fantasy built on it, but the people with decent factor analysis training will just walk off.

For the aforementioned course, it’s free on SIAI Square, the open platform supported by SIAI. That’s the very first part of the official degree program at SIAI’s Gordon School of Business and Artificial Intelligence. In fact, we made it at the admission exam.

Why? We only provide free training where it will be respected — because rigor is wasted on those chasing hype.

Wow, I just found out that Brit Marling not only starred in Another Earth and The OA, but also wrote them. A Hollywood actress with stunning beauty + intellectual grounding to bring multiverse theory onto the screen — and even led the projects herself?