The Data:

The Movie data can be found in this Kaggle Dataset and came with the following features:

  1. name; object
  2. rating; object
  3. genre; object
  4. year; int64
  5. released; object
  6. score; float64
  7. votes; float64
  8. director; object
  9. writer; object
  10. star; object
  11. country; object
  12. budget; float64
  13. gross; float64
  14. company; object
  15. runtime; float64
Given this, I suspect that the features most correlated to gross (the revenue made by the film) will be budget, score, and votes. I suspect the former because a movie with a large budget means more money alocated to advertising. A high score implies it was well recieved and will intice more people to see it. Lastly, the votes feature is the number of people that left a score, so more votes implies that that many more people have seen the film.

Preprocessing/Cleaning

Most features had hardly any missing values. In fact, I still had 96% of the original dataset even after dropping rows with missing values.

Unfortunately, the one feature that I was very interested in--the budget of the film-- had about 27% missing values. I decided to approach the problem in three different ways and compare the results:

  1. Leave missing values as Null
  2. Replace missing values with 0
  3. Impute missing values with the average

Heatmaps

If we take leaving the Null values as our control, replacing them with zero had and intersting result. Higher correlated featuers were strengthened while lower correlated features were reduced when compared to budget. The rest of the heatmap is identical which is expected. Imputing the mean served to only reduce highly correlated features.

All Features

Out of curiosity, I converted all object features to numeric and placed them all into a heatmap. However, none of the categorical data is highly correlated to budget.

Conclusion

I was correct that votes and budget had a high correlation to the gross of a film. However, score turned out to have very low correlation to gross. This suggests a film doesn’t have to be well received to earn a large profit or vice versa. One film that comes to mind as an example of the former case The Emoji Movie:

Score: 3.3; Gross: $217 million