The objective of extreme multi-label classification is to annotate an instance with the most relevant subset of labels from an extremely large set of labels. Extreme multi-label classification has found applications in various domains such as tagging, annotation, recommendation etc. However, most successful algorithms in this space are using Bag-of-word features, which lack both semantic and syntactic information. In addition, Bag-of-words features do not work well with short text such as search queries. Deep learning has shown promising results in text classification, language modeling, computer vision, and speech. The success of deep learning lies in learning rich representations from raw data such as text, image, etc. However, current deep learning based approaches perform significantly worse than state-of-the-art in extreme classification setting. Additionally, these techniques suffer from large training time and scalability issues. Our proposed method exploits word semantics and distinct behavior of head and tail labels. It performs better than current state-of-the-art techniques and can make prediction in milliseconds.