Package hashformers

Open In Colab PyPi license

Hashtag segmentation is the task of automatically adding spaces between the words on a hashtag.

Hashformers is the current state-of-the-art for hashtag segmentation. On average, hashformers is 10% more accurate than the second best hashtag segmentation library ( more details on the docs ).

Hashformers is also language-agnostic: you can use it to segment hashtags not just in English, but also in any language with a GPT-2 model on the Hugging Face Model Hub.

✂️ Read the documentation

✂️ Segment hashtags on Google Colab

✂️ Follow the step-by-step tutorial

Basic usage

from hashformers import WordSegmenter

ws = WordSegmenter(
    segmenter_model_name_or_path="gpt2",
    reranker_model_name_or_path="bert-base-uncased"
)

segmentations = ws.segment([
    "#weneedanationalpark",
    "#icecold"
])

print(segmentations)

# [ 'we need a national park',
# 'ice cold' ]

For more information, read the documentation for the WordSegmenter object.

Installation

pip install hashformers

It is possible to use hashformers without a reranker:

ws = WordSegmenter(
    segmenter_model_name_or_path="gpt2",
    reranker_model_name_or_path=None
)

If you want to use a reranker model, you must install mxnet. Here we install hashformers with mxnet-cu110, which is compatible with Google Colab. If installing in another environment, replace it by the mxnet package compatible with your CUDA version.

pip install mxnet-cu110 
pip install hashformers

Contributing

Pull requests are welcome! Read our paper for more details on the inner workings of our framework.

If you want to develop the library, you can install hashformers directly from this repository ( or your fork ):

git clone https://github.com/ruanchaves/hashformers.git
cd hashformers
pip install -e .

Relevant Papers

Citation

@misc{rodrigues2021zeroshot,
      title={Zero-shot hashtag segmentation for multilingual sentiment analysis}, 
      author={Ruan Chaves Rodrigues and Marcelo Akira Inuzuka and Juliana Resplande Sant'Anna Gomes and Acquila Santos Rocha and Iacer Calixto and Hugo Alexandre Dantas do Nascimento},
      year={2021},
      eprint={2112.03213},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Evaluation

In this figure we compare hashformers with HashtagMaster ( also known as "MPNR" ) and ekphrasis on five hashtag segmentation datasets.

HashSet-1 is a sample from the distant HashSet dataset. HashSet-2 is the lowercase version of HashSet-1, and HashSet-3 is the manually annotated portion of HashSet. More information on the datasets and their evaluation is available on the HashSet paper.

A script to reproduce the evaluation of ekphrasis is available on scripts/evaluate_ekphrasis.py.

dataset library accuracy
BOUN HashtagMaster 81.60
ekphrasis 44.74
hashformers 83.68
HashSet-1 HashtagMaster 50.06
ekphrasis 0.00
hashformers 72.47
HashSet-2 HashtagMaster 45.04
ekphrasis 55.73
hashformers 47.43
HashSet-3 HashtagMaster 41.93
ekphrasis 56.44
hashformers 56.71
Stanford-Dev HashtagMaster 73.12
ekphrasis 51.38
hashformers 80.04
average (all) HashtagMaster 58.35
ekphrasis 41.65
hashformers 68.06
Expand source code
"""
.. include:: ../../README.md
.. include:: ../../docs/EVALUATION.md
"""
from .segmenter import *

Sub-modules

hashformers.beamsearch
hashformers.ensemble
hashformers.evaluation
hashformers.experiments
hashformers.segmenter