Turkish Named Entity Recognition (NER) Model Fine-Tuning with LLM

6 min readNov 28, 2023

Everyone who touches artificial intelligence and big data is interested in Generative AI and LLM models, which have become very popular lately. In this title, I will discuss the issue of Named Entity Recognition (NER) in Turkish language supported LLM models, which interests us. Since Turkish is an agglutinative language, unfortunately there are not many resources on language processing and the efficiency rates of the resources available are below the expected level. In this article, I tried to give examples of how you can create a NER model by applying the fine-tuning function on any LLM model you can get from HuggingFace. I hope it will be helpful to friends who are interested in the subject :)

LLM Model Source: https://huggingface.co/dbmdz/bert-base-turkish-cased

Introduction

Model’s current version, Turkish OSCAR (https://oscar-project.org ) data source’s filtered version + Wikipedia Turkish dumps + many OPUS documents (https://opus.nlpl.eu ) & It was trained on a custom corpus provided by Carnegie Mellon University Computer Science Department Member’s Kemal Oflazer (https://www.andrew.cmu.edu/user/ko/ )

The size of the final training data is 35 GB and consists of 44,04,976,662 tokens.

Named Entity Recognition Dataset

Turkish NER Data Source: https://huggingface.co/datasets/wikiann

Data Source Review:

WikiANN is a multilingual named entity recognition dataset consisting of annotated Wikipedia articles with LOC (location), PER (person), and ORG (organization) tags in IOB2 format.

Supported Missions and Leaderboards:

Named Entity Recognition (NER): The dataset can be used to train a model for named entity recognition in multiple languages or to evaluate the zero-shot cross-linguistic capabilities of multilingual models.

There is a Turkish dataset of 40,000 lines (20,000 train + 10,000 test + 10,000 validation) containing the 3 mentioned label types as well as many languages.

Sample Data Set and Format:

Hardware Specs

Model Review

At the starting point, we need to create our own entity types other than the 3 predefined ones (PER, LOC, ORG). For this; In an Excel file, you need to add the tokens separated in the first column, the ner tags in the second column, the language in the third column, and the label of the asset you want to define in the last column. The longer you extend this list and increase the variety of examples, the greater the probability and success rate of recognizing a similar entity within the sentence structure.

Fine Tuning

First, we need to download the wikiann dataset from huggingface and add the data in the NER excel file we created to this dataset, combining all labeled data as follows and collecting them under a single integrated dataset.

In the next step, we need to map all the token values and the label values we labeled.

In order to ensure correct label matching in the config file to be created, we must also specify the label values when loading the pre-trained model for which we will do token classification. Here, we use the dbmdz/bert-base-turkish-cased model instead of the model_checkpoint parameter. You can use different model names or different classes depending on the scenario and suitability.

Now that we have the model and data to train, we can set up our arguments for training. Here we need to determine values such as the learning rate, how many epochs we want to train for, whether the model should be automatically deployed to the huggingface hub environment, whether the metrics should be transferred to the cloud for reporting, and what the batch size should be for train and evaluation. This is a very sensitive and critical point for the ultimate success rate of the model.

We define our metric calculation method to calculate values such as precision, recall, f1 score and accuracy in the confusion matrix.

Trainer class is the main component that allows us to carry out the training process by combining all the definitions, parameters and data we have prepared so far. Thanks to this class, we can create a fine-tuned model by using the train method to ensure that the training is carried out on the resources we reserve along with the arguments we give. Then, we can calculate the success metrics of the trained model by using the validation data using the evaluate method.

If you want to save the model in your local environment, you can save all the outputs under a folder by giving a name as follows. Afterwards, you can create new model objects by targeting the file path you saved and perform your inference functions by specifying which use-case you will use it for, thanks to the pipeline class.

Evaluation Results

Train Metrics

GPU Usage Metrics

Inference API & Containerization

After creating and saving the model, you can call it from your local or perform inference functions through the version deployed to the huggingface repo. For this, in addition to the token you will receive from Huggingface and the definitions below, you need to write a Flask application as follows so that it can be returned via a service. The sample service I wrote is an http POST method that works with a text parameter sent from a single body by default. To avoid any problems with access, it would be better to broadcast via port 8000.

Now that your model and Python Flask web API application are ready, you need to package it as a container and distribute it to your environments such as kubernetes or openshift. After turning it into a container with the simple docker file below, you can deploy it to any environment you want.

Conclusion

The final version of the model, which has been fine-tuned and its performance calculated, is available on the link below on the huggingface website. you can access. We carried out this study together with my teammates Bengü Sanem Pazvant, Ceyda Hamurcu and Seray Boynuyoğun from the IBM Client Engineering team. I would like to thank all my teammates for their contributions and efforts.