* Arthur Conan Doyle, The Adventures of Sherlock Holmes #4: The Boscombe Valley Mystery
The purpose of this series of big data analytics posts is trying to answer my hypothetical question:
“Is it possible for a person (that’s me) with near absolute zero knowledge and experience about an industry (jewelry industry), to build a ‘virtual expertise’ by applying data science techniques on related data (twitter data) ?”
On twitter, we have users as content creators and content as 280 character text with hashtags, url’s and pic’s. To get related data for our ‘jewelry’ concept, we collect this content with jewelry keyword by searching for a particular time period.
We collected 326,923 tweets hosting jewelry keyword for a time period of 15 days between 10 August 2018 – 25 August 2018. And there are 113,344 users who created that content.
When we look at the tweets structurally, there are two main types of content:
1) Passively created content: Retweets (RT@username)
2) Actively created content: Mentioned tweets (@username) and tweets which do not mention anyone.
Here, another type of pattern emerges:
1) Content with direct connections between users: Retweets and mentioned tweets.
2) Content with no connections between users: Tweets which do not mention anyone.
At this point, we’ll move on with content with no connections, i.e. tweets which do not mention anyone. Why? Because if we mix them before jumping in data analysis, valuable insights coming from both sides will end up as a fuzzy and incomprehensible results mixture. Retweets and mentions should be analysed seperately. Although analysing retweets and mentions with different kind of techniques would give valuable insight to us, this is the subject of another data analysis project.
The main reason why we move on with tweets with no connections is to make the analysis resemble more like a “market research” on jewelry industry. This way we collect people’s opinions for a specific topic even more naturally than any market research. In market research, there are interviews with predefined questionnaries. Interviewees are forced to answer specific questions with restricted choices like: “What do you think about this topic? Choose A, B, or C.”, “Agree or disagree?”, “Choose a number from range 1 to 10.”, etc. But questionnaries must be designed this way, otherwise it would be very difficult to make a statistical analysis on interview results. In our case, we don’t ask any question to anyone about jewelry industry. They share their real thoughts willingly as most of the content is created with advertising and marketing purposes.
However, twitter data has its own disadvantages:
1) Most of the tweets are not meaningful sentences. They are fuzzy and incomplete in meaning.
2) A considerable amount of tweets are spams which decrease data quality.
But overcoming these two disadvantages will be our greatest driver in building our tactical solution. We haven’t built our analytic strategy yet. We will build it from bottom up, not as a top down approach. A strategy built without data & inferences from data would be a volatile one.
And for the last part of our data selection process, I choose tweets with at least one hashtag which are neither retweeted nor mentioned. As most of the tweet texts are incomplete and meaningless sentences, I focus on the most meaningful part: the #hashtag.
Our dataset is 11.8% portion of the main volume. It does not contain any retweet or mention and hosts at least one hashtag. 38,474 tweets were created by 8,480 users and there are 16,898 unique hashtags inside our 38,474 tweets dataset.
Before initiating the analysis, we’ll consider what industry authorities and experienced consultants think about jewelry industry and what their predictions are. Will be a match or mismatch between the findings of these research reports and our data analytics?
One of the most cited reports is McKinsey & Company’s research generated in 2013 with predictions towards 2020: “A multifaceted future: The Jewelry Industry in 2020”. Here are some passages from this report:
1. “Jewelry players can’t simply do business as usual and expect to thrive; they must be alert and responsive to important trends and developments or else risk being left behind by more agile competitors.” – We will explore macro, micro, even nano trends (personalized insights) through our analytics flow and share some of them as case studies with jewelry designers & makers.
2. “While branded jewelry accounts for only 20 percent of the overall jewelry market today, its share has doubled since 2003. All executives we interviewed believe branded jewelry will claim a higher share of the market by 2020, but their views differ on how quickly this shift will occur. Most expect that the branded segment will account for 30 to 40 percent of the market in 2020.” – We’ll build macro and micro “concept categories” for suitable brand positioning by evolutionary hashtag segmentation.
3. “In the past, most of the growth in branded jewelry came from the expansion of established jewelry brands, such as Cartier and Tiffany & Co., and new entrants such as Pandora and David Yurman. By contrast, future growth in branded jewelry is likely to come from nonjewelry players in adjacent categories such as high-end apparel or leather goods—companies like Dior, Hermès, and Louis Vuitton—introducing jewelry collections or expanding their assortment.” – We’ll investigate the relationships between our concept categories and build inferences about most probable “concept partnership” opportunities.
4. “According to a recent McKinsey survey, two-thirds of luxury shoppers say they engage in online research prior to an in-store purchase; one- to two-thirds say they frequently turn to social media for information and advice.” – Here, we can deduce that inferences generated from our twitter data also affects the minds of offline luxury shoppers.
5. “…Furthermore, the previously clear-cut boundaries between fine jewelry (characterized by the use of precious metals and stones) and fashion jewelry (typically made of plated alloys and crystal stones) are starting to blur…Industry insiders expect that segments will increasingly be defined by price points and brand positions rather than purchase and wearing occasions…In light of this trend, fine jewelers might consider introducing new product lines at affordable prices to entice younger or less affluent consumers, giving them an entry point into the brand.” – We’ll explore and give some examples of potential new product combination oppurtunities between fine jewelry and fashion jewelry by examining and benchmarking our related hashtag segments.
Now we can start analyzing our data. We said that our dataset holds 38,474 tweets created by 8,480 users. Imagine we build a virtual personality with these 8,480 users and call her Sophia, since most of the twitter users are females here. We want to explore which concepts are the main drivers in content creation about jewelry topic. What’s going on in Sophia’s ‘collective mind’ if we would ask her: Sophia, what do you think about ‘jewelry’ concept? The answer would be:
Would it be safe to use these results in your advertising and marketing campaigns? I don’t think so…
Results of word counts or word clouds are only statistical summaries. Although they give some useful general information they do not give insights for targeted action… and worse, if you take action with these stationary results, most probably you will find yourself getting trapped with negative outcomes…
Like a state’s building its decision to act by taking account of all dynamic international relations, not only static statistical summaries; you should take into account of dynamic inter-concept relationships which have two main core qualities:
1) Direction of relationship
2) Intensity of relationship
Let me explain the two ‘statistically attractive’ poisonous apples hidden in the bar chart:
1. If you try to attract users to jewelry main concept with ‘etsy’ keyword (which means you recommend users to use both keywords together in their messages,i.e. tweets), then you will end up with a negative effect of causing users to create 40.0% less content about jewelry!
What do these results mean? Is Etsy as a brand unsuccessful at brand positioning?.. No! Just the opposite…
Generally, if a concept is negatively correlated with some concept(s); then it is positively correlated with some other concept(s). Can you guess which one Etsy is positively correlated to? It is: ‘handmade’.
And here is the mathematical proof of concept:
If you try to attract users to ‘handmade’ main concept with ‘etsy’ keyword under jewelry domain, then you will end up with a positive effect of causing users to create 21.7% more content about handmade! And this relationship is in the first rank among the positively correlated ones for ‘handmade’.
As you can see, the brand is quite successful in its positioning since they define themselves mostly with “handmade” concept.
At this point, one might ask such a question: “What do you mean with these results? Is ‘handmade’ not positively related with jewelry products?!..”
No, it is; we’ll see in later stages of our analytic workflow such examples. It is negatively correlated with the main jewelry concept, but it is positively correlated with jewelry sub-concepts; for instance, with ‘bracelet’. Handmade main region does not contain only jewelry products, there are bags, purses, leather products, etc. And jewelry main region is not attached to only handmade jewelry, there is branded jewelry as well.
So, after eliminating negative correlations with jewelry main concept from our bar chart, we have the remaining words:
Do you think, you can start to use these results directly?.. I haven’t finished yet. We investigated the first part of our equation: Direction of relationships.
The second part of our equation is: Intensity of relationships. The order of importance of these concepts is not what it looks like… At this point, we will pay attention to another dimension of data: the effect of users. Normally, intensity of relationship in such kind of a data analysis project is measured with tweet counts. Let’s move on with an example and compare data mining results of top two hashtags, vintage and fashion:
Case 2) fashion and jewelry hashtags are co-occured in 2,786 tweets in our dataset.
Both co-occurences are positively correlated at rates close to each other with jewelry concept.
Up to this point, you can make analyses I’ve made and get similar results with some kind of commercial data mining software(or by mixing and manipulating some open source packages if you know how to code and use the logic of underlying algorithms) and/or with consultancy. But from this point on, we slowly put our own fingerprints on subsequent locations in data analytics workflow.
For every concept relationship, we consider related “user intensity” in addition to tweet density. Why? If our dataset were a supermarket transaction database, we wouldn’t care so much about how many unique users co-purchased related products as soon as their baskets get full with high visitor frequencies,i.e. basket counts. But this is not the case here in text mining. If we do not consider unique user effect on generated text body, then the probability of being exposed spammer effects will increase. Now, back to our example cases with user intensities added.
Case 1) vintage and jewelry hashtags are co-occured in 3,233 tweets created by 139 users in our dataset.
Case 2) fashion and jewelry hashtags are co-occured in 2,786 tweets created by 431 users in our dataset.
Both cases are valuable, but which one do you choose if you had only one choice? I would choose the second one: fashion and jewelry co-occurence. The first one is like making a survey in a county, the second one is like making a survey in a province. The latter would give a more solid result about general public opinion… Here, the latter gives a more important insight about jewelry industry in general.
When we re-order our top hashtags with this combined logic, the importance order of top 6 hashtags is as follows:
Result of Part I:
In order to maximize your chances of communicating with users about ‘jewelry’ main concept, you can attract them first with ‘fashion’ main concept, and then make mention of necklace, earring, ring and bracelet as jewelry types in respective order of importance.
I’ll go forward with my second post:
“Data Science on Jewelry Industry – Part II: Neuroscience on Building Strategy”