TimKoornstra commited on
Commit
4728211
·
1 Parent(s): 63e6b7f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -6
README.md CHANGED
@@ -12,11 +12,34 @@ language:
12
  - en
13
  ---
14
 
15
- # SAURON: a Stylistic AUthorship RepresentatiON model
16
 
17
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
18
 
19
- For more information, the corresponding thesis, and the training setup, see the [GitHub repository](https://github.com/TimKoornstra/SAURON).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ## Usage (Sentence-Transformers)
22
 
@@ -32,7 +55,7 @@ Then you can use the model like this:
32
  from sentence_transformers import SentenceTransformer
33
  sentences = ["This is an example sentence", "Each sentence is converted"]
34
 
35
- model = SentenceTransformer('{MODEL_NAME}')
36
  embeddings = model.encode(sentences)
37
  print(embeddings)
38
  ```
@@ -58,8 +81,8 @@ def mean_pooling(model_output, attention_mask):
58
  sentences = ['This is an example sentence', 'Each sentence is converted']
59
 
60
  # Load model from HuggingFace Hub
61
- tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
62
- model = AutoModel.from_pretrained('{MODEL_NAME}')
63
 
64
  # Tokenize sentences
65
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 
12
  - en
13
  ---
14
 
15
+ # SAURON: Stylistic AUthorship RepresentatiON Model
16
 
17
+ ## Overview
18
 
19
+ SAURON is a sentence-transformers model designed to represent the unique stylistic nuances of authorship. By mapping sentences and paragraphs into a 768-dimensional dense vector space, SAURON can be employed for tasks such as clustering or stylistic search. This model was developed as part of a master's thesis in Artificial Intelligence, and it leverages semantically similar utterances to enhance writing style embedding models.
20
+
21
+ ## Key Features
22
+
23
+ - **Semantically Similar Utterances**: SAURON uses pairs of utterances that convey the same meaning but are expressed differently in style. This approach helps the model focus more on the stylistic aspects rather than the content.
24
+ - **Diverse Training Data**: The model was trained on a vast range of conversations from Reddit, ensuring a broad representation of both authorship and topics.
25
+ - **Performance Evaluation**: The STyle EvaLuation (STEL) framework was employed to gauge the model's efficacy in capturing writing styles.
26
+ - **Content Control**: The introduction of semantically similar utterances greatly enhanced performance, offering better control over content.
27
+
28
+ ## Applications
29
+
30
+ - **Stylistic Search**: Search for content based on its writing style rather than its subject matter.
31
+ - **Clustering**: Group text based on the stylistic similarities of the authors.
32
+ - **Style-Content Disentanglement**: Enhance models and applications that require distinguishing between style and content.
33
+
34
+ ## Research Insights
35
+
36
+ 1. While semantically similar utterances significantly improved performance, the most efficient approach combines this technique with conversation-based sampling.
37
+ 2. Strategies such as maintaining diversity in authorship and topics proved effective for data preparation.
38
+ 3. The SAURON model considerably outperformed its predecessors, marking a significant step forward in style-content disentanglement tasks.
39
+
40
+ ## More Information
41
+
42
+ For a comprehensive overview, including the complete thesis and training setup details, visit the [SAURON GitHub repository](https://github.com/TimKoornstra/SAURON).
43
 
44
  ## Usage (Sentence-Transformers)
45
 
 
55
  from sentence_transformers import SentenceTransformer
56
  sentences = ["This is an example sentence", "Each sentence is converted"]
57
 
58
+ model = SentenceTransformer('TimKoornstra/SAURON')
59
  embeddings = model.encode(sentences)
60
  print(embeddings)
61
  ```
 
81
  sentences = ['This is an example sentence', 'Each sentence is converted']
82
 
83
  # Load model from HuggingFace Hub
84
+ tokenizer = AutoTokenizer.from_pretrained('TimKoornstra/SAURON')
85
+ model = AutoModel.from_pretrained('TimKoornstra/SAURON')
86
 
87
  # Tokenize sentences
88
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')