Forum Replies Created
-
AuthorPosts
-
August 26, 2025 at 11:26 am in reply to: Create Your Own Personal Local AI Expert Using Your X Data #29669
A good tutorial to get started and to learn the lingo. My overall result was a failure, but that’s okay. Learned some things along the way.
I think this tutorial should switch gears from using X likes and conversation to something more personable. Brian, your posts on X are always reflective of being more human. I think this tutorial would go a long way if you can switch to GMail, and text message correspondence from iCloud.
I don’t reply or post anything on X because of the nature of social media and companies. Companies now have contracts outlining social media as being dangerous and something to avoid. Unless you work for yourself, or you’re a major figure in the industry, it makes no sense to have a opinion on anything publicly. No reason to risk of lose your position in society or job from a post.
After 15 years in web dev industry, much of this stuff is foreign to me, and it’s been quite a humbling experience with this new tech. Not to the mention the number of issues I ran into. Web Developer 15+ years, but also I drew outside the areas with a crayon trying something different.
I attempted to use Grok 4 to guide me through the process of reconfiguring the tutorial for GMail. Disclaimer: I am no Python expert. I provided Grok the existing code examples (X data) from the tutorial, and told it to make it work for GMail conversations.
DISCLAIMER FOR LOOKING: THESE GISTS DO NOT WORK, BUT MAYBE TO GIVE INSIGHT WHY MY MODEL FAILED.
gmail_prepare_data.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Disclaimer: Attempted reworking the tutorial for GMail conversation over X.com. # MAYBE A STARTING POINT FOR SOMEONE, AFTER SPENDING AN ENTIRE WEEKEND PUTTING IT ALL TOGETHER, THIS DID NOT WORK # https://readmultiplex.com/2025/08/17/create-your-own-personal-local-ai-expert-using-your-x-data-a-complete-beginners-guide/ # after 4 hours of training model and GMail. The model was all over the place, and did not work. # Most of what is here was assisted by Grok AI 4 import json import re import html import mailbox import email.utils from email.header import decode_header from datasets import Dataset from collections import defaultdict from datetime import datetime from email.policy import default # For modern email parsing from tqdm import tqdm # For progress bar; install with pip if needed import mmap # For memory-mapped file reading # User configuration – replace with your details USER_EMAILS = ["myemail@gmail.com", "my.email@gmail.com"] # Your Gmail addresses to identify sent emails USER_NAME = "My Name" # Your name for filtering or display MBOX_FILE = r"G:\My Path to Mbox\All mail Including Spam and Trash.mbox" # Path to your Gmail MBOX file from Google Takeout # Custom MboxReader using mmap for efficient sequential parsing of large files class MboxReader: def __init__(self, filename): self.file = open(filename, 'rb') self.mm = mmap.mmap(self.file.fileno(), 0, access=mmap.ACCESS_READ) self.pos = 0 line = self._readline() if not line.startswith(b'From '): raise ValueError("File does not start with 'From ' – not a valid mbox") def __enter__(self): return self def __exit__(self, exc_type, exc_value, exc_traceback): self.mm.close() self.file.close() def __iter__(self): return self def __next__(self): lines = [] while True: line = self._readline() if not line or line.startswith(b'From '): if lines: return email.message_from_bytes(b''.join(lines), policy=default) if not line: raise StopIteration continue # Handle escaped 'From ' lines if line.startswith(b'>From '): line = line[1:] lines.append(line) def _readline(self): start = self.pos end = self.mm.find(b'\n', start) + 1 if end == 0: # No more lines line = self.mm[start:] self.pos = len(self.mm) else: line = self.mm[start:end] self.pos = end return line # Helper function to safely decode text with fallback def safe_decode(text, charset): if not isinstance(text, bytes): return text try: return text.decode(charset or 'utf-8', errors='ignore') except (LookupError, UnicodeDecodeError): # Fallback to utf-8 with replace if charset invalid or decode fails return text.decode('utf-8', errors='replace') # Function to decode headers (e.g., subject) def decode_header_text(header): decoded = decode_header(header or "") return ''.join([safe_decode(text, charset) for text, charset in decoded]) # Function to strip HTML tags def strip_html(text): text = re.sub(r'<[^>]+>', '', text) return html.unescape(text) # Function to extract plain text body (prefers text/plain, falls back to text/html) def get_text_body(msg): if msg.is_multipart(): for part in msg.get_payload(): if part.get_content_type() == 'text/plain': return part.get_payload(decode=True).decode(errors='ignore').strip() elif part.get_content_type() == 'text/html': html_body = part.get_payload(decode=True).decode(errors='ignore') return strip_html(html_body).strip() else: if msg.get_content_type() == 'text/plain': return msg.get_payload(decode=True).decode(errors='ignore').strip() elif msg.get_content_type() == 'text/html': html_body = msg.get_payload(decode=True).decode(errors='ignore') return strip_html(html_body).strip() return "" # Function to parse date, making it timezone-naive def parse_date(date_str): try: dt = email.utils.parsedate_to_datetime(date_str) return dt.replace(tzinfo=None) # Strip timezone to make naive except: return datetime.min # Process messages with progress messages = {} id_to_children = defaultdict(list) all_msg_ids = set() with MboxReader(MBOX_FILE) as mbox: for msg in tqdm(mbox, desc="Parsing emails"): # Progress bar here msgid = msg['Message-ID'] if not msgid: continue all_msg_ids.add(msgid) from_header = msg['From'] if not from_header: continue name, addr = email.utils.parseaddr(from_header) name = decode_header_text(name) addr = addr.lower() to_header = msg['To'] to_name, to_addr = email.utils.parseaddr(to_header) if to_header else ("", "") to_name = decode_header_text(to_name) subject = decode_header_text(msg['Subject']) body = get_text_body(msg) # Skip empty or very short emails early if not body or len(body) < 50: continue # Basic cleaning: remove quoted text (lines starting with >) body = re.sub(r'(?m)^>.*\n?', '', body).strip() # Skip if still too short if len(body) < 50: continue in_reply_to = msg['In-Reply-To'] messages[msgid] = { "msgid": msgid, "from_name": name, "from_addr": addr, "to_name": to_name, "to_addr": to_addr, "subject": subject, "body": body, "date": msg['Date'], "parsed_date": parse_date(msg['Date']), "in_reply_to": in_reply_to, "is_sent": addr in [e.lower() for e in USER_EMAILS] and name == USER_NAME } if in_reply_to: id_to_children[in_reply_to].append(msgid) # Find root messages (no in_reply_to or in_reply_to not in messages) roots = [mid for mid in messages if not messages[mid]["in_reply_to"] or messages[mid]["in_reply_to"] not in all_msg_ids] # Function to build thread recursively def build_thread(mid, visited=set()): if mid in visited: return [] visited.add(mid) thread = [messages[mid]] for child in sorted(id_to_children[mid], key=lambda c: messages.get(c, {}).get("parsed_date", datetime.min)): thread.extend(build_thread(child, visited)) return thread # Build all threads with progress threads = [] for root in tqdm(roots, desc="Building threads"): thread_msgs = build_thread(root) # Only include threads with at least one sent message by you if any(m["is_sent"] for m in thread_msgs): threads.append(thread_msgs) # Format into conversational dataset and save as JSONL with open('gmail_conversations.jsonl', 'w', encoding='utf-8') as f: for thread in tqdm(threads, desc="Formatting and saving data"): if len(thread) <= 1: # Skip single-message threads continue conversation = [] thread_subject = thread[0]["subject"] # Use the root subject for msg in thread: role = "assistant" if msg["is_sent"] else "user" from_name = msg["from_name"] or msg["from_addr"] content = f"From: {from_name}\nSubject: {msg['subject']}\n\n{msg['body']}" conversation.append({"role": role, "content": content}) # Write each conversation as a JSON line json.dump({ "conversation": conversation, "subject": thread_subject }, f) f.write('\n') print("Conversational dataset prepared! Saved to 'gmail_conversations.jsonl'") gmail_finetune.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Disclaimer: Attempted reworking the tutorial for GMail conversation over X.com. # MAYBE A STARTING POINT FOR SOMEONE, AFTER SPENDING AN ENTIRE WEEKEND PUTTING IT ALL TOGETHER, THIS DID NOT WORK # https://readmultiplex.com/2025/08/17/create-your-own-personal-local-ai-expert-using-your-x-data-a-complete-beginners-guide/ # after 4 hours of training model and GMail. The model was all over the place, and did not work. # Most of what is here was assisted by Grok AI 4 from peft import get_peft_model, LoraConfig from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from trl import SFTTrainer, SFTConfig from datasets import load_dataset from huggingface_hub import login import torch import multiprocessing import os # Suppress Hugging Face symlink warning (optional) os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "true" model_name = "google/gemma-3-270m" # Load tokenizer (standard for early mapping) tokenizer = AutoTokenizer.from_pretrained( model_name, token="TOKEN GOES HERE" # Your token for gated model ) # Set custom chat template for Gemma tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}<start_of_turn>user\n{{ message['content'] | trim }}<end_of_turn>\n{% elif message['role'] == 'model' %}<start_of_turn>model\n{{ message['content'] | trim }}<end_of_turn>\n{% elif message['role'] == 'system' %}<start_of_turn>system\n{{ message['content'] | trim }}<end_of_turn>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<start_of_turn>model\n{% endif %}" # Formatting function to apply chat template def formatting_prompts_func(examples): convos = examples["conversation"] texts = [] for convo in convos: convo_mapped = [{"role": "model" if msg["role"] == "assistant" else msg["role"], "content": msg["content"]} for msg in convo] convo_mapped.insert(0, {"role": "system", "content": "You are [[ Name Goes Here ]], responding in your personal style."}) text = tokenizer.apply_chat_template(convo_mapped, tokenize=False, add_generation_prompt=False) texts.append(text) return {"text": texts} if __name__ == '__main__': multiprocessing.freeze_support() login(token="TOKEN GOES HERE") # Your token dataset = load_dataset('json', data_files='gmail_conversations.jsonl', split='train') dataset = dataset.map(formatting_prompts_func, batched=True, num_proc=12) # Use your Ryzen's cores for mapping # Define quantization config to replace deprecated load_in_4bit quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4" ) # Load the base model on GPU with 4-bit quantization model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=quantization_config, torch_dtype=torch.bfloat16, # bfloat16 for GPU device_map="auto", # Auto-map to GPU attn_implementation="eager", # For Gemma-3 stability token="TOKEN GOES HERE" # Your token ) # Add LoRA for efficient fine-tuning peft_config = LoraConfig( r=32, # Original rank lora_alpha=32, lora_dropout=0, bias="none", target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], use_rslora=True # For stability ) model = get_peft_model(model, peft_config) model.enable_input_require_grads() # For gradient checkpointing # Enable gradient checkpointing model.gradient_checkpointing_enable() # Calculate max_steps (aim for ~3 epochs) effective_batch_size = 2 * 2 # Reduced to avoid OOM max_steps = max(60, (len(dataset) // effective_batch_size) * 3) # Train on GPU trainer = SFTTrainer( model=model, processing_class=tokenizer, train_dataset=dataset, args=SFTConfig( per_device_train_batch_size=2, # Reduced to avoid OOM gradient_accumulation_steps=2, warmup_steps=20, max_steps=max_steps, learning_rate=2e-4, fp16=False, # Disable to bypass scaler error bf16=True, # Enable bf16 for GPU logging_steps=1, output_dir="my_personal_ai", optim="adamw_torch", # Standard optimizer gradient_checkpointing=True, # Save memory save_steps=100, logging_dir="my_personal_ai/logs", dataloader_num_workers=4, # Moderate for GPU dataset_text_field="text", # Moved to SFTConfig max_length=2048, # Fixed name packing=False # Moved to SFTConfig ) ) trainer.train() # Save the fine-tuned LoRA adapter model.save_pretrained("my_personal_ai_model") # Merge LoRA with base model for easier use from peft import PeftModel merged_model = PeftModel.from_pretrained( AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", token="TOKEN GOES HERE"), "my_personal_ai_model" ).merge_and_unload() merged_model.save_pretrained("my_personal_ai_merged") print("Fine-tuning complete! Model saved to 'my_personal_ai_merged'.") # For GGUF export (optional, requires llama.cpp): # git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make # python convert_hf_to_gguf.py –outfile my_personal_ai_gguf.gguf –outtype q8_0 my_personal_ai_merged After setting up CUDA, and a number of back and forths with gmail_finetune.py in getting the settings just right. 4 hours of training time and going through the painful process of converting it to a gguf with llama.cp.
Loaded the model up in LM Studio, and the model acted bonkers on replies. Anything I questioned it’s name, took it’s name. Said it was from Toronto Canada, studies art, etc. Just far opposite, nothing related to me. The reply and responses sometimes would loop.
Recommendations for tutorial:
Python. If Windows 10 install, recommend installing on same disk as OS. Nightmare to uninstall, and pip install never worked if not on OS disk. Lots of environmental PATH rework to uninstall and start fresh.
Overall, great work on this guide – it’s inspiring more of us to experiment. Looking forward to any updates or thoughts from the community!
August 25, 2025 at 11:34 pm in reply to: Create Your Own Personal Local AI Expert Using Your X Data #29670After going through all the steps, and attempting at tailoring it for GMail conversations. There is no doubt in my mind that this stuff will eventually be baked into local client apps within the next year or two.
A very easy UI to reference data to build a model from:
GMail.mbox
iCloud messages
Etc.Without the pain of navigating Python exports and the numerous libraries, packages, and dependencies to make it happen.
I’m thinking maybe it’s just best to wait.
-
This reply was modified 1 month, 1 week ago by
Drumsin.
-
This reply was modified 1 month, 1 week ago by
-
AuthorPosts