| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						license: apache-2.0 | 
					
					
						
						| 
							 | 
						datasets: | 
					
					
						
						| 
							 | 
						- papluca/language-identification | 
					
					
						
						| 
							 | 
						language: | 
					
					
						
						| 
							 | 
						- en | 
					
					
						
						| 
							 | 
						- de | 
					
					
						
						| 
							 | 
						- fr | 
					
					
						
						| 
							 | 
						- es | 
					
					
						
						| 
							 | 
						metrics: | 
					
					
						
						| 
							 | 
						- precision | 
					
					
						
						| 
							 | 
						- recall | 
					
					
						
						| 
							 | 
						- f1 | 
					
					
						
						| 
							 | 
						- accuracy | 
					
					
						
						| 
							 | 
						pipeline_tag: text-classification | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						# German, English, French and Spanish Language Detector | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						The GEFS-language-detector language model demonstrated exceptional performance, achieving an impressive F1 score close to 100%. This result significantly exceeds typical benchmarks and underscores the model's accuracy and reliability in identifying languages. | 
					
					
						
						| 
							 | 
						This is a fined tuned model by using the dataset of papluca [Language Identification](https://huggingface.co/datasets/papluca/language-identification#additional-information) and the base model [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) . | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Predicted output: | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Model will return the language detection in the language codes like:  | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						  - de as German | 
					
					
						
						| 
							 | 
						  - en as English | 
					
					
						
						| 
							 | 
						  - fr as French | 
					
					
						
						| 
							 | 
						  - es as Spanish | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						   | 
					
					
						
						| 
							 | 
						## Supported languages | 
					
					
						
						| 
							 | 
						Currently this model support 4 languages but in future more languages will be added.  | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Following languages supported by the model: | 
					
					
						
						| 
							 | 
						- German (de) | 
					
					
						
						| 
							 | 
						- English (en) | 
					
					
						
						| 
							 | 
						- French (fr) | 
					
					
						
						| 
							 | 
						- Spanish (es) | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						# Use a pipeline as a high-level helper | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						from transformers import pipeline | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						text=["Mir gefällt die Art und Weise, Sprachen zu erkennen", | 
					
					
						
						| 
							 | 
						      "I like the way to detect languages", | 
					
					
						
						| 
							 | 
						      "Me gusta la forma de detectar idiomas", | 
					
					
						
						| 
							 | 
						      "J'aime la façon de détecter les langues"] | 
					
					
						
						| 
							 | 
						pipe = pipeline("text-classification", model="ImranzamanML/GEFS-language-detector") | 
					
					
						
						| 
							 | 
						lang_detect=pipe(text, top_k=1) | 
					
					
						
						| 
							 | 
						print("The detected language is", lang_detect) | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						# Load model directly | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						```python | 
					
					
						
						| 
							 | 
						from transformers import AutoTokenizer, AutoModelForSequenceClassification | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						tokenizer = AutoTokenizer.from_pretrained("ImranzamanML/GEFS-language-detector") | 
					
					
						
						| 
							 | 
						model = AutoModelForSequenceClassification.from_pretrained("ImranzamanML/GEFS-language-detector") | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Model Training | 
					
					
						
						| 
							 | 
						   | 
					
					
						
						| 
							 | 
						    Epoch	  Training Loss	    Validation Loss | 
					
					
						
						| 
							 | 
						    1	      0.002600	        0.000148   | 
					
					
						
						| 
							 | 
						    2	      0.001000	        0.000015 | 
					
					
						
						| 
							 | 
						    3	      0.000000	        0.000011 | 
					
					
						
						| 
							 | 
						    4	      0.001800	        0.000009 | 
					
					
						
						| 
							 | 
						    5	      0.002700	        0.000016 | 
					
					
						
						| 
							 | 
						    6	      0.001600	        0.000012 | 
					
					
						
						| 
							 | 
						    7	      0.001300	        0.000009 | 
					
					
						
						| 
							 | 
						    8	      0.001200	        0.000008 | 
					
					
						
						| 
							 | 
						    9	      0.000900	        0.000007 | 
					
					
						
						| 
							 | 
						    10	      0.000900	        0.000007 | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Testing Results | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						    Language   Precision   Recall	F1 	     Accuracy | 
					
					
						
						| 
							 | 
						    de	       0.9997	   0.9998	0.9998   0.9999 | 
					
					
						
						| 
							 | 
						    en	       1.0000	   1.0000	1.0000	 1.0000 | 
					
					
						
						| 
							 | 
						    fr	       0.9995	   0.9996	0.9996	 0.9996 | 
					
					
						
						| 
							 | 
						    es	       0.9994	   0.9996	0.9995	 0.9996 | 
					
					
						
						| 
							 | 
						``` | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## About Author | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						**Name**: Muhammad Imran Zaman  | 
					
					
						
						| 
							 | 
						**Company**: [Theum AG](https://theum.com/en/index.htm?t=)  | 
					
					
						
						| 
							 | 
						**Role**: Machine Learning Engineer  | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						**Professional Links**: | 
					
					
						
						| 
							 | 
						  - Kaggle: [Profile](https://www.kaggle.com/muhammadimran112233) | 
					
					
						
						| 
							 | 
						  - LinkedIn: [Profile](linkedin.com/in/muhammad-imran-zaman) | 
					
					
						
						| 
							 | 
						  - Google Scholar: [Profile](https://scholar.google.com/citations?user=ulVFpy8AAAAJ&hl=en) | 
					
					
						
						| 
							 | 
						  - YouTube: [Channel](https://www.youtube.com/@consolioo) | 
					
					
						
						| 
							 | 
						  - GitHub: [Channel](https://github.com/Imran-ml) | 
					
					
						
						| 
							 | 
						
 |