Why we should continue believing that every company needs to leverage on data analytics

For someone who had a glimpse of the peril and promise of data analytics, I still have hope and aspiration that companies within public and private sectors in Malaysia should believe in the benefits of using data analytics to help solve their business problems and making it a priority. After returning from the US thinking that Malaysia is still far behind, I was not entirely correct. There is awareness certainly. More and more companies are jumping in the bandwagon to try and do it. The word “big data”, “analytics”, “machine learning” or “AI” pops up in multiple pages of the public listed annual reports.

However, I have to say that sometimes my belief/faith in the future of this too sometimes got tested. Firstly, because of the quality of the data and the state of data infrastructure that the companies have require a lot of massaging and improvement. Secondly, the lack of talent in this area. Thirdly and most importantly, lack of belief or buy in from the CEOs/senior management to enforce the execution and thus provide the right support/environment.

Hence, we need more advocates by the leaders in Malaysia, especially from the ones who believe in the promise of data analytics, as rightly captured by the article from the recent Future of Work conference. You can read it here. Mad at myself for not attending that conference.

 

Heart for Data Analytics

There are many reasons as to why I am passionate about data analytics.

But one of the reasons that have kept my interest going is as explained by Bill Gates in his recent post – Here’s one great way to use your tech skills. While I’m not the coder like the William guy he mentioned, I certainly know the endless possibilities that data analytics could do to solve social problems. And if there’s one problem that I could apply to solve, it is the education.

Imagine if we are able to collect the data about students’ activities and behaviour in schools which includes the time they get into school, the number of times they are on MC, miss class, their grades, their responsiveness in class, their areas of interest, we can identify trends and root causes of why some students perform (or not) and understand the drivers behind it to help us come up with a more customized solution to help the students excel.

Of course it’s easier said than done. Coming up with the analysis is the easy part. The difficult part is to get good data. We need to encourage the teachers to start collecting the data from the students, they probably have but only pockets of it and not in structured manner. If you are interested with this problem, do let me know!

What project to choose for machine learning

It’s interesting how Andrew Ng explained in a simple manner on what machine learning can do and cannot do in his “AI for Everyone” course that I’m currently taking. I think it helps me at least to think of the projects that can be tested for machine learning.

Rather than listing down all the problems we have (in an organization), think of the activities/tasks that do not require you to think by more than 1 second to do/decide. This is called ‘simple concept’ tasks which do not need a lot of mental thought especially the ones that you are currently doing manually. This can be replaced by computers – machine learning can help cut down your manual work. Provided, you have enough data to supply – in terms of volume, richness and completeness (i.e. there’s input and output).

Try it and start cracking your head to list down all the relevant tasks.

By doing so, we can start doing pilot projects and assess whether it’s feasible to continue in a bigger scale. The idea is to execute multiple projects in a year. According to Andrew, implementing 1 AI project in 1 year is extremely long. We need to do more than that to speed up our learning process.

AI for everyone on Coursera

Early last month I mentioned about “AI for Everyone” course on Coursera, taught by Andrew Ng. I advocated to do it, but I myself haven’t signed up since, until today. So I just did.

After I signed up, this message appeared (see bottom right of the picture below):

aiforeveryone

What a clever way to get me started immediately. So I watched the first video, i.e. the introduction. To my surprise, Coursera platform has improved tremendously, the last time I took a course on Coursera was a few years ago (didn’t finish it and I can’t even remember the name of the course). Here’s what I found extremely useful, so far:

  1. Transcript of each video is readily available, below each video. So if you miss some of the words the lecturer said, you can just read the transcript, instead of replaying the video multiple times.
  2. You can also easily save any parts of the video as part of note taking. And you can easily replay the saved parts and the transcript is automatically downloaded as well. So you can choose to replay or read the transcript. All you need to do is just press the “SAVE NOTE” button. See below.

savenote

The notes will appear as below:

Notes

See how Coursera has made our lives easy just to encourage us to learn (and use their platform)?

 

MSC Malaysia Revenue by Focus Area and Cluster

Was curious to see what other data is available on my country’s open data portal as I have done before. Got attracted to data on MSC Malaysia where they shared the revenue breakdown by focus area and cluster whereby I believe the former is a subset of the latter. I’ve used Power BI to visualize the data as below. As usual, before I can export the data to Power BI, I have to transpose the data into a database format. I wish those who share the data on the open data portal would present the data in a database format instead of a table/summary format. The point of making the data accessible to everyone is also to make their lives easier to analyse the data. It’s database 101 with first row as the header of all columns and first column as the key reference to all data in other columns.

Anyway, see below the revenue breakdown of MSC Malaysia by Focus Area and Cluster from 2011 to 2017.

cluster 1Cluster 2Focus area 1focus area 2

One thing good to note is that the revenue has been on an upward trajectory since 2011 for both by cluster and focus area. By focus area, i.e. the last chart, it’s not surprising that eCommerce is the largest segment. But what’s surprising to me is the significant revenue growth in Cloud and Data Centre (“CDC”) in 2016 and 2017. The revenue was only about a third of eCommerce revenue in 2011 but in 2017, revenue from CDC is almost half of eCommerce. BDA which stands for Big Data Analytics is still relatively small as compared to other focus areas and we have not seen the explosive growth in Malaysia yet, but I suspect this will be the next key growth, at least in the next few years. We’ll see.

Trends in Investment Management

Ronald N. Kahn reiterated 7 trends in Investment Management in the CFA Institute Annual Conference recently, which he highlighted in his book “The Future of Investment Management”. If you are in the investment management industry like me, you should take note of these trends.

You can read the summary here.

If you have more time, you can read the full book here. I have always preferred reading books from practitioners or learning from them as they can relate to the real world, it’s more practical but at the same time they rely on theories that seem very complex to us to understand. Ronald N. Kahn is one of them, and he works at Blackrock, the world’s largest asset manager as the MD of Global Head of Scientific Equity Research. So we ought to learn from him.

Among the 7 trends he highlighted, trend number 4 on Big Data is something that every investment professional should pay more attention at.

Do you care that Google knows what you buy?

Do you know that Google knows what you buy? Or rather, Google tracks what you buy. Not entirely everything because some purchases were made by cash, some by online banking, some by credit card, as long as the receipts don’t get sent to your email, Google won’t be able to capture those spending. So it’s only receipts, invoices or any type of purchase acknowledgments that are emailed to you will be tracked by Google.

Click here to see your purchase history to see what Google knows about what you buy:

http://myaccount.google.com/purchases

Here’s a screenshot of my purchase history:

Google gets this because I bought the ebooks via amazon.com and all receipts were sent to my gmail.

According to this article by CNBC,  Google told them that customers/users can turn off the tracking entirely but it’s not as straightforward as it seems and when CNBC tried to do it, there was no such option.

Do I care if Google knows what I buy? For the time being, no as I’m not getting any negative side effects. If anything, I feel that page of Purchases on my Gmail account is a useful summary for me.

Facebook opening data up will pave the way for other corporations to follow suit in a more legal and ethical manner

“Facebook will open its data up to academics to see how it impacts elections”

The headline above seen in MIT Technology Review twitter feed definitely caught my attention as it was timely and related to my post yesterday.

So last week Facebook announced the first researchers who will have access to Facebook’s privacy-protected data as part of its role to promote independent research on social media’s role on elections. You can read the announcement here. Basically, Facebook wants to correct the world’s perceptions on them that their existence makes the world a better place, they do not misuse or allow third parties unknowingly misuse their biggest asset which is the users data.

I applaud this initiative, ignoring any political agenda behind it, if there is. This will actually set the foundation/framework on data sharing because Facebook aims to do it by “ensuring that privacy is preserved and information kept secure” and that it “acts in accordance with its legal and ethical obligations to the people who use their service”. Whatever they intend to do, they would not compromise people’s privacy. According to the announcement, Facebook has “consulted some of the country’s leading external privacy advisors and the Social Science One privacy committee for recommendations on how best to ensure the privacy of the data sets shared and have rigorously tested their infrastructure to make sure it is secure.

What’s interesting to me is they are building a process to remove personal identifiable information (“PII”) from the data set and specifically testing the application of differential privacy, an increasingly used innovative method of anonymising data which is a machine learning technique based on neural networks. In ODI’s report on Anonymisation and Open Data, differential privacy is defined as follows:

Differential privacy is a property of data systems that allows collection of aggregated statistics about a dataset but obfuscates individual records. When queried, a small amount of noise is added to the data such that if any one record were removed, the query result would stay the same. This means those using the data can never be entirely certain about any single person’s data.

If this is deemed successful, this will actually pave the way for other corporations specifically the traditional ones who are sitting on customers data to have the comfort of sharing privacy-protected data to external parties to harness the power of big data. The biggest challenge is to get the traditional lawyers, CEOs, senior management understand that anonymised data is NOT personal data.

Is hashed/anonymised data personal data (part 2)

Eversince I stumbled upon the Open Data Institute as I mentioned last week, I’ve been following what they do and reading their blog, research papers, opinions etc. Thinking whether I should I subscribe as an individual member so that I get to be part of their solution on how to build an open, trustworthy data ecosystem. Data is a subject close to my heart as you can read from my other related posts here and here (if you have not and new to my blog!). 

Anyway, ODI recently in their blog talked about anonymisation and synthetic data which are techniques to remove identification of personalities from personal data so that the data can be shared either openly or closely but to third parties for a good use which needs to be defined by data stewards. But the question that I have until today, is from legal’s perspective, are anonymised/hashed data considered as personal data still? If you ask me, the answer is no, because you can’t simply unearth the person’s name or information from the anonymised data.

Here’s a sample of how a raw data transformed into anonymised data, say for example I have access to number of hours logged on Instagram by a set of people.

Raw data:

Name ID Date of Birth Number of hours logged on Instagram per day
Melissa D 1/1/1988 3.2
Ali Muthu 5/24/1993 5.6
Abigail 8/15/1976 2.8

Anonymised data:

Name ID Age-Range Number of hours logged on Instagram per day
507581 30-35 3.2
699393 25-30 5.6
769250 40-45 2.8

Now tell me, from the anonymised table above, how can I tell if Melissa D spent 3.2 hours a day on Instagram? I can’t, hence it shouldn’t be classified as personal and hence the data can be shared. But I can’t overrule the law, can I? Based on what I’ve read so far, the answer to this is still very vague but it’s not as rigid as we thought it is.

Below is the definition stated by the most stringent law when it comes to personal data which I extracted from ODI’s report on Anonymisation and Open Data:

The EU General Data Protection Regulation (GDPR) defines personal data as:

any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person

The UK’s Office for National Statistics defines private information as: information that

  • relates to an identifiable legal or natural person, and
  • is not in the public domain or common knowledge, and

The key word here is identifiable. If it is not identifiable, then it should no longer be deemed as personal data, at least that’s how I would interpret it.

Why does this question matter and the answer needs to be transparent so that more data sharing can be done for value creation purposes by corporations? So that we don’t need to spend hours arguing with our traditional lawyers haha.

Jokes aside, it matters so that we can focus on the real work than the risk we carry by processing, sharing and using the data. It matters because you can adopt all kinds of techniques or tools to reduce the risk of personal data re-identification, but if the law remains vague, it’s difficult for traditional-mindset companies to innovate and harness the power of big data in this so called data-hungry world.

ODI’s report on Anonymisation and Open Data also highlighted that Syntethic Data which is created by an automated process such that it holds similar statistical patterns as an original dataset, can contain no personal data even though it is based on a dataset that holds personal data. The automated process is done by a machine learning method called deep-learning, a method which has gained fame in recent years and utilized by some of the big players in the US and China (and as you may know, the method which can make or break self-driving cars). If this holds true, then I would have the answer to my question on is anonymised data personal data. 

 

Data Trust

Stumbled upon an article on FT on New Institutions are Needed for Digital Age which mentioned that “Open Data Institute and others are exploring data trusts — where control over data-sharing is transferred to an independent third party, legally bound to ensure its use for a defined purpose”. This then led me to a report published recently by Open Data Institute on data trusts which summarizes the framework and findings from their first in-depth study on role of data trusts.

While the concept/term may not be new to us, the interest/appetite is definitely growing among government and large corporations wanting to create more value through data sharing without worrying about privacy issues. Below is the screenshot of data trust framework for your easy reference.

data trust.PNG

Their definition of data trust is independent of technology architectures – centralized or decentralized platform, cloud hosting, blockchain or not doesn’t matter – as long as it is technically flexible to our changing needs.

If you are interested, you can read more about it here and the report here.

A Must Watch to Understand the Big Picture of AI

 

A Ted Talk by Kai-Fu Lee, the AI expert, also my new found “love” on geeks/experts which caused me to swing from my current read to his autobiography “My Journey into AI”. Read my previous post.

After you watch this video, you will understand better the importance of learning and embracing AI because AI will be embedded in our lives much more in the next 10-15 years or even 5 years depending on the level of sophistication. I feel that it needs to be in one of the syllabus of students curriculum.

How Today’s Readings Led Me to AI Superpowers

I love to reflect the journey on how I discovered things (which include ideas, theories, although most of the time people) that enlighten me, inspire me, engage me, which then led me or connect me to another dot, especially when I landed on someone successful that made me really curious about him/her, made me believe in him/her; essentially follow him/her online just to go through his/her mind/work/everyday lives. The latter sounds like I’m a stalker, but you get my point right? For example, how I ended up admiring Fred Wilson which then led me to starting this blog, how I ended up admiring Melinda Gates for her work on data and women empowerment, and many many more. I actually have a list.

Today, I discovered another person that will be in my “stalking” list. He is Kai-Fu Lee, a Taiwanese VC and most importantly an AI expert. In his capacity as an AI expert, he wrote a book called AI Superpowers where he focuses on how AI will save humanity and how humans can take advantage of AI instead of risking our jobs to AI.

So how I discovered him? It’s through The Algorithm newsletter I subscribed because I enjoy reading Karen Hao’s views (MIT Technology Review journalist). In this week’s newsletter she mentioned about how TikTok, the upcoming and rising social media platform (from China) is replacing our free will with algorithms. I got really curious about this famous app after learning that Andreessen Horowitz thinks this app is unique as it is the first AI consumer-based app. Even people in the US is crazy about it and it has 500m users already! I’m trying to understand the difference that this app offers vs. Other social media platforms which leverage on AI to curate what customers wants/needs. Apparently for TikTok, the product itself is based on AI. Anyway this is another story for another day.

From the newsletter, I clicked on Andreessen Horowitz’s blog to read about their take on TikTok and the rise of AI-based consumer apps. You can read it here. At the end of his post, the general partner of Andreessen Horowitz mentioned about the book called AI Superpowers, written by Kai-Fu Lee. Googled about it and found the author’s website and got immediately hooked.

Apart from his in-depth knowledge on AI, he also shared a quiz for us to take to see if our job is at risk of being replaced by AI and learn about our own human superpowers so that we can thrive in the future. I thought it was a great quiz and would highly recommend everyone to take it. Here’s my results:

PHEW!

The assessment about me is actually quite accurate I would say. And apparently it says I have a spontaneous personality which possesses characteristics that AI can’t imitate. See below.

100% agreed except for the communication part which I’m still working on it.

Anyway, that’s the story of how I came across AI superpowers and the author. It’s definitely going to be in my to-read list.