Recently, mixed-integer programming (MIP) techniques have been applied to learn optimal decision trees. Empirical research has shown that optimal trees typically have better out-of-sample performance than heuristic approaches such as CART. However, the underlying MIP formulations often suffer from weak linear programming (LP) relaxations. Many existing MIP approaches employ big-M constraints to ensure observations are routed throughout the tree in a feasible manner. This paper introduces new MIP formulations for learning optimal decision trees with multivariate branching rules and no assumptions on the feature types. We first propose a strong baseline MIP formulation that still uses big-M constraints, but yields a stronger LP relaxation than its counterparts in the literature. We then introduce a problem-specific class of valid inequalities called shattering inequalities. Each inequality encodes an inclusion-minimal set of points that cannot be shattered by a multivariate split, and in the context of a MIP formulation, the inequalities are sparse, involving at most the number of features plus two variables. We propose a separation procedure that attempts to find a violated inequality given a (possibly fractional) solution to the LP relaxation; in the case where the solution is integer, the separation is exact. Numerical experiments show that our MIP approach outperforms two other MIP formulations in terms of solution time and relative gap, and is able to improve solution time while remaining competitive with regards to out-of-sample accuracy in comparison to a wider range of approaches from the literature.