ncls-p lbourdois commited on
Commit
71a2bff
·
verified ·
1 Parent(s): f7d8250

Improve language tag (#1)

Browse files

- Improve language tag (ca5bf71c61efc4e1b2ec2a21e5e98b6184322c66)


Co-authored-by: Loïck BOURDOIS <[email protected]>

Files changed (1) hide show
  1. README.md +133 -121
README.md CHANGED
@@ -1,122 +1,134 @@
1
- ---
2
- language:
3
- - en
4
- tags:
5
- - qwen2
6
- - text-generation
7
- - summarization
8
- - key-points
9
- - blog-summarization
10
- - unsloth
11
- datasets:
12
- - ncls-p/blog-key-points
13
- license: cc-by-4.0
14
- base_model: Qwen/Qwen2.5-7B-Instruct
15
- ---
16
-
17
- # Qwen2.5-7B-blog-key-points
18
-
19
- This model is fine-tuned from [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) on the [blog-key-points dataset](https://huggingface.co/datasets/ncls-p/blog-key-points). It specializes in extracting key points from blog articles and web content, providing concise bullet-point summaries that capture the essential information.
20
-
21
- ## Model Description
22
-
23
- **Qwen2.5-7B-blog-key-points** is a 7B parameter model fine-tuned specifically for the task of extracting key points from articles. It can process a full article and generate a concise, bullet-point summary highlighting the most important information. Compared to the 3B version, this model offers enhanced capabilities for understanding complex articles and generating more nuanced summaries.
24
-
25
- ### Model Details
26
-
27
- - **Model Type:** Qwen2.5 (7B parameters)
28
- - **Base Model:** [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
29
- - **Training Dataset:** [ncls-p/blog-key-points](https://huggingface.co/datasets/ncls-p/blog-key-points)
30
- - **Language:** English
31
- - **License:** [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
32
- - **Finetuning Approach:** Instruction fine-tuning on article-summary pairs
33
-
34
- ## Uses
35
-
36
- ### Direct Use
37
-
38
- This model is designed for extracting key points from articles. You can use it directly for:
39
-
40
- - Summarizing blog posts
41
- - Extracting important information from news articles
42
- - Creating bullet-point summaries of long-form content
43
- - Generating concise overviews of research papers
44
- - Distilling complex information into digestible points
45
-
46
- ### Example Usage
47
-
48
- ```python
49
- from transformers import AutoModelForCausalLM, AutoTokenizer
50
-
51
- model_id = "ncls-p/Qwen2.5-7B-blog-key-points"
52
- tokenizer = AutoTokenizer.from_pretrained(model_id)
53
- model = AutoModelForCausalLM.from_pretrained(model_id)
54
-
55
- article = """
56
- [Your article text here]
57
- """
58
-
59
- prompt = f"""
60
- Extract the key points from the following article:
61
-
62
- {article}
63
- """
64
-
65
- inputs = tokenizer(prompt, return_tensors="pt")
66
- outputs = model.generate(**inputs, max_length=1024)
67
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
68
-
69
- print(response)
70
- ```
71
-
72
- ## Training
73
-
74
- The model was fine-tuned on the [blog-key-points dataset](https://huggingface.co/datasets/ncls-p/blog-key-points), which contains 200 article-summary pairs. Each pair consists of a full article and a bullet-point summary of key points extracted using AI.
75
-
76
- ### Training Procedure
77
-
78
- - **Fine-tuning Framework:** [Unsloth](https://github.com/unslothai/unsloth)
79
- - **Training Data Format:**
80
- ```json
81
- {
82
- "instruction": "",
83
- "input": "Full article content",
84
- "output": "Here are the key points of the article:\n* Key point 1\n* Key point 2\n* Key point 3\n..."
85
- }
86
- ```
87
-
88
- ## Evaluation
89
-
90
- The model was evaluated on its ability to extract relevant key points from articles not seen during training. Evaluation metrics focused on:
91
-
92
- 1. **Relevance:** How well the extracted points capture the main ideas of the article
93
- 2. **Conciseness:** The ability to summarize information in a clear, bullet-point format
94
- 3. **Completeness:** Whether all important information is captured in the summary
95
- 4. **Coherence:** The logical flow and organization of the extracted points
96
-
97
- ## Limitations and Biases
98
-
99
- - The model may inherit biases present in the training data, including potential biases in the source articles or in the key point extraction process.
100
- - Performance may vary depending on the length, complexity, and domain of the input article.
101
- - The model is primarily trained on English-language content and may not perform well on content in other languages.
102
- - As with any summarization model, there is a risk of omitting important information or misrepresenting the original content.
103
- - While the 7B parameter size offers improved capabilities over the 3B version, it also requires more computational resources to run.
104
-
105
- ## How to Cite
106
-
107
- If you use this model in your research, please cite:
108
-
109
- ```bibtex
110
- @misc{qwen25-7b-blog-key-points,
111
- author = {ncls-p},
112
- title = {Qwen2.5-7B-blog-key-points},
113
- year = {2024},
114
- publisher = {Hugging Face},
115
- journal = {Hugging Face model repository},
116
- howpublished = {\url{https://huggingface.co/ncls-p/Qwen2.5-7B-blog-key-points}},
117
- }
118
- ```
119
-
120
- ## Dataset Creation
121
-
 
 
 
 
 
 
 
 
 
 
 
 
122
  The dataset used to train this model was created using the [llm-to-blog-key-points-dataset](https://github.com/ncls-p/llm-to-blog-key-points-dataset), a CLI tool that extracts key points from web articles AI and adds them to a dataset in a structured format.
 
1
+ ---
2
+ language:
3
+ - zho
4
+ - eng
5
+ - fra
6
+ - spa
7
+ - por
8
+ - deu
9
+ - ita
10
+ - rus
11
+ - jpn
12
+ - kor
13
+ - vie
14
+ - tha
15
+ - ara
16
+ tags:
17
+ - qwen2
18
+ - text-generation
19
+ - summarization
20
+ - key-points
21
+ - blog-summarization
22
+ - unsloth
23
+ datasets:
24
+ - ncls-p/blog-key-points
25
+ license: cc-by-4.0
26
+ base_model: Qwen/Qwen2.5-7B-Instruct
27
+ ---
28
+
29
+ # Qwen2.5-7B-blog-key-points
30
+
31
+ This model is fine-tuned from [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) on the [blog-key-points dataset](https://huggingface.co/datasets/ncls-p/blog-key-points). It specializes in extracting key points from blog articles and web content, providing concise bullet-point summaries that capture the essential information.
32
+
33
+ ## Model Description
34
+
35
+ **Qwen2.5-7B-blog-key-points** is a 7B parameter model fine-tuned specifically for the task of extracting key points from articles. It can process a full article and generate a concise, bullet-point summary highlighting the most important information. Compared to the 3B version, this model offers enhanced capabilities for understanding complex articles and generating more nuanced summaries.
36
+
37
+ ### Model Details
38
+
39
+ - **Model Type:** Qwen2.5 (7B parameters)
40
+ - **Base Model:** [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
41
+ - **Training Dataset:** [ncls-p/blog-key-points](https://huggingface.co/datasets/ncls-p/blog-key-points)
42
+ - **Language:** English
43
+ - **License:** [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
44
+ - **Finetuning Approach:** Instruction fine-tuning on article-summary pairs
45
+
46
+ ## Uses
47
+
48
+ ### Direct Use
49
+
50
+ This model is designed for extracting key points from articles. You can use it directly for:
51
+
52
+ - Summarizing blog posts
53
+ - Extracting important information from news articles
54
+ - Creating bullet-point summaries of long-form content
55
+ - Generating concise overviews of research papers
56
+ - Distilling complex information into digestible points
57
+
58
+ ### Example Usage
59
+
60
+ ```python
61
+ from transformers import AutoModelForCausalLM, AutoTokenizer
62
+
63
+ model_id = "ncls-p/Qwen2.5-7B-blog-key-points"
64
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
65
+ model = AutoModelForCausalLM.from_pretrained(model_id)
66
+
67
+ article = """
68
+ [Your article text here]
69
+ """
70
+
71
+ prompt = f"""
72
+ Extract the key points from the following article:
73
+
74
+ {article}
75
+ """
76
+
77
+ inputs = tokenizer(prompt, return_tensors="pt")
78
+ outputs = model.generate(**inputs, max_length=1024)
79
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
80
+
81
+ print(response)
82
+ ```
83
+
84
+ ## Training
85
+
86
+ The model was fine-tuned on the [blog-key-points dataset](https://huggingface.co/datasets/ncls-p/blog-key-points), which contains 200 article-summary pairs. Each pair consists of a full article and a bullet-point summary of key points extracted using AI.
87
+
88
+ ### Training Procedure
89
+
90
+ - **Fine-tuning Framework:** [Unsloth](https://github.com/unslothai/unsloth)
91
+ - **Training Data Format:**
92
+ ```json
93
+ {
94
+ "instruction": "",
95
+ "input": "Full article content",
96
+ "output": "Here are the key points of the article:\n* Key point 1\n* Key point 2\n* Key point 3\n..."
97
+ }
98
+ ```
99
+
100
+ ## Evaluation
101
+
102
+ The model was evaluated on its ability to extract relevant key points from articles not seen during training. Evaluation metrics focused on:
103
+
104
+ 1. **Relevance:** How well the extracted points capture the main ideas of the article
105
+ 2. **Conciseness:** The ability to summarize information in a clear, bullet-point format
106
+ 3. **Completeness:** Whether all important information is captured in the summary
107
+ 4. **Coherence:** The logical flow and organization of the extracted points
108
+
109
+ ## Limitations and Biases
110
+
111
+ - The model may inherit biases present in the training data, including potential biases in the source articles or in the key point extraction process.
112
+ - Performance may vary depending on the length, complexity, and domain of the input article.
113
+ - The model is primarily trained on English-language content and may not perform well on content in other languages.
114
+ - As with any summarization model, there is a risk of omitting important information or misrepresenting the original content.
115
+ - While the 7B parameter size offers improved capabilities over the 3B version, it also requires more computational resources to run.
116
+
117
+ ## How to Cite
118
+
119
+ If you use this model in your research, please cite:
120
+
121
+ ```bibtex
122
+ @misc{qwen25-7b-blog-key-points,
123
+ author = {ncls-p},
124
+ title = {Qwen2.5-7B-blog-key-points},
125
+ year = {2024},
126
+ publisher = {Hugging Face},
127
+ journal = {Hugging Face model repository},
128
+ howpublished = {\url{https://huggingface.co/ncls-p/Qwen2.5-7B-blog-key-points}},
129
+ }
130
+ ```
131
+
132
+ ## Dataset Creation
133
+
134
  The dataset used to train this model was created using the [llm-to-blog-key-points-dataset](https://github.com/ncls-p/llm-to-blog-key-points-dataset), a CLI tool that extracts key points from web articles AI and adds them to a dataset in a structured format.