Add missing tokenizer test files [:building_construction: in progress] #16627

SaulLu · 2022-04-06T10:07:24Z

🚀 Add missing tokenizer test files

Several tokenizers currently have no associated tests. I think that adding the test file for one of these tokenizers could be a very good way to make a first contribution to transformers.

Tokenizers concerned

not yet claimed

none

claimed

with an ongoing PR

none

with an accepted PR

Longformer @tgadeliya Add missing tokenizer tests - Longformer #17677
MobileBert @leondz MobileBERT tokenizer tests #16896
RetriBert @mpoemsl Add missing RetriBERT tokenizer tests #17017

How to contribute?

Claim a tokenizer

a. Choose a tokenizer from the list of "not yet claimed" tokenizers

b. Check that no one in the messages for this issue has indicated that they care about this tokenizer

c. Put a message in the issue that you are handling this tokenizer
Create a local development setup (if you have not already done it)

I refer you to section "start-contributing-pull-requests" of the Contributing guidelines where everything is explained. Don't be afraid with step 5. For this contribution you will only need to test locally the tests you add.
Follow the instructions on the readme inside the templates/adding_a_missing_tokenization_test folder to generate the template with cookie cutter for the new test file you will be adding. Don't forget to move the new test file at the end of the template generation to the sub-folder named after the model for which you are adding the test file in the tests folder. Some details about questionnaire - assuming that the name of the lowcase model is brand_new_bert:
- "has_slow_class": Set true there is a tokenization_brand_new_bert.py file in the folder src/transformers/models/brand_new_bert
- "has_fast_class": Set true there is a tokenization_brand_new_bert_fast.py file the folder src/transformers/models/brand_new_bert.
- "slow_tokenizer_use_sentencepiece": Set true if the tokenizer defined in the tokenization_brand_new_bert.py file uses sentencepiece. If this tokenizer don't have a ``tokenization_brand_new_bert.py` file set False.
Complete the setUp method in the generated test file, you can take inspiration for how it is done for the other tokenizers.
Try to run all the added tests. It is possible that some tests will not pass, so it will be important to understand why, sometimes the common test is not suited for a tokenizer and sometimes a tokenizer can have a bug. You can also look at what is done in similar tokenizer tests, if there are big problems or you don't know what to do we can discuss this in the PR (step 7.).
(Bonus) Try to get a good understanding of the tokenizer to add custom tests to the tokenizer
Open an PR with the new test file added, remember to fill in the RP title and message body (referencing this PR) and request a review from @LysandreJik and @SaulLu.

Tips

Do not hesitate to read the questions / answers in this issue 📰

The text was updated successfully, but these errors were encountered:

tgadeliya · 2022-04-06T13:28:25Z

Hi, I would like to add tests for Longformer tokenizer

anmolsjoshi · 2022-04-06T15:15:16Z

@SaulLu I would like to add tests for Flaubert

Rajathbharadwaj · 2022-04-06T20:40:35Z

Hey I would like to contribute for Electra,Pointers please!

SaulLu · 2022-04-07T17:34:40Z

Thank you all for offering your help!

@Rajathbharadwaj ,sure! what do you need help with? Do you need more details on any of the steps listed in the main post?

farahdian · 2022-04-10T14:32:58Z

Hi, first time contributor here-could I add tests for Splinter?

farahdian · 2022-04-11T16:29:28Z

Is anyone else encountering this error with the cookiecutter command? my dev environment set up seemed to have went all fine...
Also I had run the command inside the tests/splinter directory

SaulLu · 2022-04-11T16:56:31Z

@faiazrahman , thank you so much for working on this! Regarding your issue, if you're in the tests/splinter folder, can you try to run cookiecutter ../../templates/adding_a_missing_tokenization_test/ ?

You should have a newly created folder cookiecutter-template-BrandNewBERT inside tests/splinter. 🙂

If that's the case, you'll need after to do something like:

mv cookiecutter-template-BrandNewBERT/test_tokenization_brand_new_bert.py .
rm -r cookiecutter-template-BrandNewBERT/

Keep me posted 😄

farahdian · 2022-04-12T07:50:47Z

Thanks so much @SaulLu turns out it was due to not recognizing my installed cookiecutter so i sorted it out there. 👍

SaulLu · 2022-04-19T08:26:05Z

Hi @anmolsjoshi, @tgadeliya, @Rajathbharadwaj and @farahdian,

Just a quick message to see how things are going for you and if you have any problems. If you do, please share them! 🤗

farahdian · 2022-04-19T08:30:57Z

Thanks @SaulLu ! I've been exploring the tokenization test files in the repo just trying to figure out which ones would be a good basis for writing a tokenization test for splinter... if you have any guidance on this it would be super helpful!

Rajathbharadwaj · 2022-04-19T09:02:48Z

Hey @SaulLu my apologies, been a bit busy. I'll get started ASAP, however, I still didn't understand where exactly I should run the cookie cutter

Help on this would be helpful 😄

SaulLu · 2022-04-19T09:09:29Z

Hi @farahdian ,

Thank you very much for the update! To know where you stand, have you done step 3)? Is it for step 4) that you are looking for a similar tokenizer? 🙂

SaulLu · 2022-04-19T09:32:25Z

Hi @Rajathbharadwaj ,

Thank you for the update too!

I still didn't understand where exactly I should run the cookie cutter

You can run the cookie cutter command anywhere as long as the command is followed by the path to the folder adding_a_missing_tokenization_test in the transformers repo that you have cloned locally.

When you run the command, it will create a new folder at the location you are. In this folder you will find a base for the python test file that you need to move inside the tests/electra folder of the transformers local clone. Once this file is moved, you can delete the folder that was created by the cookie cutter command.

Below is an example of the sequence of bash commands I would personally use:

(base) username@hostname:~$ cd ~/repos
(base) username@hostname:~/repos$ git clone git@github.com:huggingface/transformers.git
[Install my development setup]
(transformers-dev) username@hostname:~/repos$ cookiecutter transformers/templates/adding_a_missing_tokenization_test/
[Answer the questionnaire]
(transformers-dev) username@hostname:~/repos$ mv cookiecutter-template-Electra/test_tokenization_electra.py transformers/tests/Electra
(transformers-dev) username@hostname:~/repos$ rm -r cookiecutter-template-Electra/

Hope that'll help you 😄

farahdian · 2022-04-19T09:40:56Z

Appreciate your patience @SaulLu ! Yup I've done step 3 and generated a test tokenization file with cookiecutter. Now onto working on the setUp method 😄

SaulLu · 2022-04-19T10:15:50Z

@farahdian , this is indeed a very good question: finding the closest tokenizer to draw inspiration from and identifying the important difference with that tokenizer is the most interesting part.

For that there are several ways to start:

Identify the high level features of the tokenizer by looking at the contents of the model's "reference" checkpoint files (listed inside the PRETRAINED_VOCAB_FILES_MAP global variables in the tokenizer's files) on the hub. A similar model would most likely store the tokenizer vocabulary in the same way (with only a vocab file, with both a vocab and a merges files, with a sentencepiece binary file or with only a tokenizer.json file).
Read the high level explanation of the model in the transformers documentation (e.g. for Splinter)
Read the paper corresponding to the model
Look at the implementation in transformers lib
Look at the original implementation of the model (often mentioned in the paper)
Look at the discussions on the PR in which the model was added

For the model you're in charge @farahdian:

Transformers's doc mention that:

Use SplinterTokenizer (rather than BertTokenizer), as it already contains this special token. Also, its default behavior is to use this token when two sequences are given (for example, in the run_qa.py script).
Splinter's paper mention that:

Splinter-base shares the same architecture (transformer encoder (Vaswani et al., 2017)), vocabulary (cased wordpieces), and number of parameters (110M) with SpanBERT-base (Joshi et al., 2020).
And SpanBERT's paper mention that:

We reimplemented BERT’s model and pre-training method in fairseq (Ott et al., 2019). We used the model configuration of BERT large as in Devlin et al. (2019) and also pre-trained all our models on the same corpus: BooksCorpus and English Wikipedia using cased Wordpiece tokens.
and the vocabulary files of bert-base-cased (vocab file) and of splinter-base (vocab file) look very similar

Given these mentions, it seems that Splinter's tokenizer is very similar to Bert's one. It would be interesting to confirm this impression and to understand all the differences between SplinterTokenizer and BertTokenizer so that it is well reflected in the test 🙂

Rajathbharadwaj · 2022-04-19T14:25:44Z

Hi @Rajathbharadwaj ,

Thank you for the update too!

I still didn't understand where exactly I should run the cookie cutter

You can run the cookie cutter command anywhere as long as the command is followed by the path to the folder adding_a_missing_tokenization_test in the transformers repo that you have cloned locally.

When you run the command, it will create a new folder at the location you are. In this folder you will find a base for the python test file that you need to move inside the tests/electra folder of the transformers local clone. Once this file is moved, you can delete the folder that was created by the cookie cutter command.

Below is an example of the sequence of bash commands I would personally use:
(base) username@hostname:~$ cd ~/repos
(base) username@hostname:~/repos$ git clone git@github.com:huggingface/transformers.git
[Install my development setup]
(transformers-dev) username@hostname:~/repos$ cookiecutter transformers/templates/adding_a_missing_tokenization_test/
[Answer the questionnaire]
(transformers-dev) username@hostname:~/repos$ mv cookiecutter-template-Electra/test_tokenization_electra.py transformers/tests/Electra
(transformers-dev) username@hostname:~/repos$ rm -r cookiecutter-template-Electra/
Hope that'll help you smile

Thank you so much @SaulLu
I understood now, however, I am skeptical about slow_tokenizer_use_sentencepiece question, but I set it to True as it had the tokenization_electra.py file but I didn't understand

"Set true if the tokenizer defined in the tokenization_brand_new_bert.py file uses sentencepiece"

So did I select correctly? Or should I set it to False? Apologies for asking so many questions 😄

However now I've started adding tests for Electra will keep you posted if I run into something I don't understand.

Thanks for helping once again!

tgadeliya · 2022-04-19T15:01:04Z

Hi @SaulLu,
I think my case the easiest one, because Longformer model uses actually the same tokenizer as RoBERTa with no differences. So, I adapted tests(small refactor and changes) from RoBERTa tokenizer and prepare branch with tests. Nevertheless, I really want to dive deeper and study code of TokenizerTesterMixin and if after that I will find some untested behaviour, I will add new tests.
But I think I have one doubt, that you can resolve. Are you anticipating from Longformer tests to have different toy tokenizer example than in RoBERTa tests? Or maybe I should write my own tests from scratch?

SaulLu · 2022-04-19T17:05:41Z

@Rajathbharadwaj , I'm happy to help! Especially as your questions will surely be useful for other people

however, I am skeptical about slow_tokenizer_use_sentencepiece question, but I set it to True as it had the tokenization_electra.py file but I didn't understand
"Set true if the tokenizer defined in the tokenization_brand_new_bert.py file uses sentencepiece"
So did I select correctly? Or should I set it to False? Apologies for asking so many questions smile

Some XxxTokenizer (without the Fast at the end, implemented in the tokenization_xxx.py file), use a backend based on the sentencepiece library. For example T5Tokenizer uses a backend based on sentencepiece: you can see this import at the beginning of the tokenization_t5.py file:

transformers/src/transformers/models/t5/tokenization_t5.py

Line 24 in 3dd57b1

import sentencepiece as spm

and you can see that the backend is instantiated here:

transformers/src/transformers/models/t5/tokenization_t5.py

Lines 151 to 152 in 3dd57b1

    
           self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) 
        
           self.sp_model.Load(vocab_file)

On the contrary, BertTokenizer for example does not use a sentencepiece backend.

I hope this helped you!

SaulLu · 2022-04-19T17:30:13Z

Hi @tgadeliya ,

Thanks for the update!

But I think I have one doubt, that you can resolve. Are you anticipating from Longformer tests to have different toy tokenizer example than in RoBERTa tests? Or maybe I should write my own tests from scratch?

In your case, I wouldn't be surprised if Longformer uses the same tokenizer as RoBERTa. In this case, it seems legitimate to use the same toy tokenizer. Maybe the only check you can do to confirm this hypothesis is comparing the vocabularies of the 'main" checkpoints of both models:

!wget https://huggingface.co/allenai/longformer-base-4096/raw/main/merges.txt
!wget https://huggingface.co/allenai/longformer-base-4096/raw/main/vocab.json
!wget https://huggingface.co/roberta-base/raw/main/merges.txt
!wget https://huggingface.co/roberta-base/raw/main/vocab.json

!diff merges.txt merges.txt.1
!diff vocab.json vocab.json.1

Turn out the result confirms it!

leondz · 2022-04-20T12:54:28Z

Hi, I'm happy to take MobileBert

elusenji · 2022-04-20T21:38:44Z

I'd like to work on ConvBert.

elusenji · 2022-04-22T12:22:28Z

Identify the high level features of the tokenizer by looking at the contents of the model's "reference" checkpoint files (listed inside the PRETRAINED_VOCAB_FILES_MAP global variables in the tokenizer's files) on the hub. A similar model would most likely store the tokenizer vocabulary in the same way (with only a vocab file, with both a vocab and a merges files, with a sentencepiece binary file or with only a tokenizer.json file).

@SaulLu I'm having trouble identifying ConvBert's 'reference' checkpoint files on the hub. Would you kindly provide more guidance on this?

SaulLu · 2022-04-22T12:34:21Z

Hi @elusenji ,

In the src/transformers/models/convbert/tokenization_convbert.py file you can find the global variable PRETRAINED_VOCAB_FILES_MAP:

transformers/src/transformers/models/convbert/tokenization_convbert.py

Lines 24 to 30 in 6d90d76

    
           PRETRAINED_VOCAB_FILES_MAP = { 
        
               "vocab_file": { 
        
                   "YituTech/conv-bert-base": "https://huggingface.co/YituTech/conv-bert-base/resolve/main/vocab.txt", 
        
                   "YituTech/conv-bert-medium-small": "https://huggingface.co/YituTech/conv-bert-medium-small/resolve/main/vocab.txt", 
        
                   "YituTech/conv-bert-small": "https://huggingface.co/YituTech/conv-bert-small/resolve/main/vocab.txt", 
        
               } 
        
           }

In particular YituTech/conv-bert-base is a reference checkpoint for ConvBert.

Is this what you were having trouble with? ☺️

elusenji · 2022-04-22T13:43:18Z

Yes, this helps!

danhphan · 2022-10-05T06:04:01Z

Hi @SaulLu , sorry for late response and being quite slow. I am still working on RemBert and will try to finish it soon in the coming weeks. Thank you.

IMvision12 · 2022-10-28T14:27:20Z

@SaulLu are there any tokenizers left???

danhphan · 2022-10-29T23:15:47Z

Hi @IMvision12, I am busy on the deadline of a couple of other projects, so can you work on RemBert? Thanks!

IMvision12 · 2022-10-30T12:25:01Z

Yeah sure @danhphan Thanks.

danhphan · 2022-11-02T02:57:26Z

Thank you @IMvision12 !

y3sar · 2023-05-04T09:48:02Z

Seems like a bit late to the party 😅. Is there any tokenizer not listed here that I can write tests for? Or maybe if some tokenizer becomes available here. Please let me know @SaulLu I would love to contribute 😀

SaulLu · 2023-05-04T14:28:40Z

Unfortunately, I don't have much time left to help with transformers now. But let me ping @ArthurZucker for visibility

ArthurZucker · 2023-05-25T13:16:51Z

Hey @y3sar thanks for wanting to contribute. I think that the RemBert tests PR was close, you can probably take that over if you want!
Other tests that might be missing:

y3sar · 2023-05-25T15:17:49Z

@ArthurZucker thanks for your reply. I will start working on RemBert tests.

rchan26 · 2023-09-28T17:34:35Z

hey @ArthurZucker, I'm happy to have a look at contributing to a few of these. I'll start off with gpt_neox 🙂

ENate · 2023-11-18T14:25:28Z

Hi. Are the tests still open for contribution? Thanks

nileshkokane01 · 2023-11-19T14:53:03Z

@ArthurZucker some of the claimed tokenizers are dormant. Can I take in one of them? If so, can you let me know which one.

cc: @SaulLu

ArthurZucker · 2023-11-20T06:58:39Z

Hey all! 🤗
If you don't find a PR open for any model feel free to do so.
If a PR is inactive for quite some time, just ping the author to make sure he is alright with you taking over or if he still want to contribute ! (unless it's inactive for more than 2 months I think it's alright to work on it) 👍🏻

ashwinjohn3 · 2023-11-22T01:19:09Z

Feel free to take Splinter Model! Unfortunately, I am not work on it anymore!

nileshkokane01 · 2023-11-22T01:37:01Z

@ashwinjohn3 thanks!

I'll work on splinter as well.

ENate · 2023-11-22T11:00:41Z

In that case, I will check rembert, convbert and choose one to work on. I also assume that you guys are working on gpx_ and splinter.

… splinter

nileshkokane01 · 2023-11-23T13:14:24Z

In that case, I will check rembert, convbert and choose one to work on. I also assume that you guys are working on gpx_ and splinter.

@ENate Hi , there is already PR for rembert. Hence, please refrain from duplication. :)

ENate · 2023-11-23T18:21:25Z

Hi @nileshkokane01 Thanks :) Will surely do.

logvinata · 2024-02-12T12:07:18Z

Hi, guys! Looks like only longformer, mobilebert and rembert have been merged. Can I try the other ones?
@nileshkokane01 hi! are you still working on splinter?
@ENate hi! are you working on convbert?

nileshkokane01 · 2024-02-12T12:08:56Z

@logvinata you can take splinter. I'm not working on it anymore.

ENate · 2024-02-12T14:52:25Z

Hi @logvinata yes, I am still going to work on it. I was off for a while but will soon open a PR on it.

bastrob · 2024-04-23T21:16:12Z

Hello, i decided to start my first contribution on the flaubert part. I encoutered an assertion error when running the following class:

Assertion error:

I tried much things but the "do_lowercase" parameter was never found in the signature of the tokenizer, until i tried to propagate it in the FlaubertTokenizer class, and the test passed:

Am i missing something ? Is there a workaround ?

Thanks in advance,
Bastien

amyeroberts · 2024-04-24T09:34:55Z

cc @ArthurZucker

ArthurZucker · 2024-04-25T14:28:40Z

Could you open a PR? It will be easier for me

SaulLu added the Good First Issue label Apr 6, 2022

SaulLu mentioned this issue Apr 13, 2022

Add test suite for flaubert tokenizer #15137

Closed

SaulLu changed the title ~~Add missing tokenizer test files~~ Add missing tokenizer test files [still need 5 first time contributors] Apr 19, 2022

SaulLu changed the title ~~Add missing tokenizer test files [still need 5 first time contributors]~~ Add missing tokenizer test files [still need 4 first time contributors] Apr 20, 2022

SaulLu changed the title ~~Add missing tokenizer test files [still need 4 first time contributors]~~ Add missing tokenizer test files [still need 3 first time contributors] Apr 21, 2022

leondz mentioned this issue Apr 22, 2022

MobileBERT tokenizer tests #16896

Merged

5 tasks

IMvision12 mentioned this issue Nov 21, 2022

Add missing tokenizer tests - RemBert #20355

Closed

nileshkokane01 mentioned this issue Nov 21, 2023

Added test cases for rembert refering to albert and reformer test_tok… #27637

Merged

5 tasks

nileshkokane01 pushed a commit to nileshkokane01/transformers that referenced this issue Nov 23, 2023

[Splinter] Fixes huggingface#16627 by implementing the test cases for…

97b1739

… splinter

nileshkokane01 linked a pull request Nov 23, 2023 that will close this issue

[WIP][Splinter] Fixes #16627 by implementing the test cases for splinter #27671

Open

5 tasks

amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Apr 24, 2024

bastrob mentioned this issue Apr 25, 2024

Add missing Flaubert tokenizer tests #30492

Open

5 tasks

Add missing tokenizer test files [:building_construction: in progress] #16627

Add missing tokenizer test files [:building_construction: in progress] #16627

Comments

SaulLu commented Apr 6, 2022 • edited

🚀 Add missing tokenizer test files

Tokenizers concerned

not yet claimed

claimed

with an ongoing PR

with an accepted PR

How to contribute?

Tips

tgadeliya commented Apr 6, 2022

anmolsjoshi commented Apr 6, 2022

Rajathbharadwaj commented Apr 6, 2022 • edited

SaulLu commented Apr 7, 2022

farahdian commented Apr 10, 2022

farahdian commented Apr 11, 2022 • edited

SaulLu commented Apr 11, 2022 • edited

farahdian commented Apr 12, 2022

SaulLu commented Apr 19, 2022

farahdian commented Apr 19, 2022

Rajathbharadwaj commented Apr 19, 2022

SaulLu commented Apr 19, 2022

SaulLu commented Apr 19, 2022

farahdian commented Apr 19, 2022

SaulLu commented Apr 19, 2022 • edited

Rajathbharadwaj commented Apr 19, 2022 • edited

tgadeliya commented Apr 19, 2022 • edited

SaulLu commented Apr 19, 2022

SaulLu commented Apr 19, 2022

leondz commented Apr 20, 2022

elusenji commented Apr 20, 2022

elusenji commented Apr 22, 2022

SaulLu commented Apr 22, 2022

elusenji commented Apr 22, 2022

danhphan commented Oct 5, 2022

IMvision12 commented Oct 28, 2022

danhphan commented Oct 29, 2022

IMvision12 commented Oct 30, 2022

danhphan commented Nov 2, 2022

y3sar commented May 4, 2023 • edited

SaulLu commented May 4, 2023

ArthurZucker commented May 25, 2023

y3sar commented May 25, 2023

rchan26 commented Sep 28, 2023

ENate commented Nov 18, 2023

nileshkokane01 commented Nov 19, 2023 • edited

ArthurZucker commented Nov 20, 2023

ashwinjohn3 commented Nov 22, 2023

nileshkokane01 commented Nov 22, 2023

ENate commented Nov 22, 2023 • edited

nileshkokane01 commented Nov 23, 2023

ENate commented Nov 23, 2023

logvinata commented Feb 12, 2024

nileshkokane01 commented Feb 12, 2024 • edited

ENate commented Feb 12, 2024

bastrob commented Apr 23, 2024 • edited

amyeroberts commented Apr 24, 2024

ArthurZucker commented Apr 25, 2024

SaulLu commented Apr 6, 2022 •

edited

Rajathbharadwaj commented Apr 6, 2022 •

edited

farahdian commented Apr 11, 2022 •

edited

SaulLu commented Apr 11, 2022 •

edited

SaulLu commented Apr 19, 2022 •

edited

Rajathbharadwaj commented Apr 19, 2022 •

edited

tgadeliya commented Apr 19, 2022 •

edited

y3sar commented May 4, 2023 •

edited

nileshkokane01 commented Nov 19, 2023 •

edited

ENate commented Nov 22, 2023 •

edited

nileshkokane01 commented Feb 12, 2024 •

edited

bastrob commented Apr 23, 2024 •

edited