@ReadMultiplex

Forum Replies Created

Viewing 2 posts - 1 through 2 (of 2 total)

Author

Posts

August 26, 2025 at 11:26 am in reply to: Create Your Own Personal Local AI Expert Using Your X Data #29669

Drumsin

Participant

Offline

@drumsin

A good tutorial to get started and to learn the lingo. My overall result was a failure, but that’s okay. Learned some things along the way.

I think this tutorial should switch gears from using X likes and conversation to something more personable. Brian, your posts on X are always reflective of being more human. I think this tutorial would go a long way if you can switch to GMail, and text message correspondence from iCloud.

I don’t reply or post anything on X because of the nature of social media and companies. Companies now have contracts outlining social media as being dangerous and something to avoid. Unless you work for yourself, or you’re a major figure in the industry, it makes no sense to have a opinion on anything publicly. No reason to risk of lose your position in society or job from a post.

After 15 years in web dev industry, much of this stuff is foreign to me, and it’s been quite a humbling experience with this new tech. Not to the mention the number of issues I ran into. Web Developer 15+ years, but also I drew outside the areas with a crayon trying something different.

I attempted to use Grok 4 to guide me through the process of reconfiguring the tutorial for GMail. Disclaimer: I am no Python expert. I provided Grok the existing code examples (X data) from the tutorial, and told it to make it work for GMail conversations.

DISCLAIMER FOR LOOKING: THESE GISTS DO NOT WORK, BUT MAYBE TO GIVE INSIGHT WHY MY MODEL FAILED.

gmail_prepare_data.py

	# Disclaimer: Attempted reworking the tutorial for GMail conversation over X.com.
	# MAYBE A STARTING POINT FOR SOMEONE, AFTER SPENDING AN ENTIRE WEEKEND PUTTING IT ALL TOGETHER, THIS DID NOT WORK
	# https://readmultiplex.com/2025/08/17/create-your-own-personal-local-ai-expert-using-your-x-data-a-complete-beginners-guide/
	# after 4 hours of training model and GMail. The model was all over the place, and did not work.
	# Most of what is here was assisted by Grok AI 4
	import json
	import re
	import html
	import mailbox
	import email.utils
	from email.header import decode_header
	from datasets import Dataset
	from collections import defaultdict
	from datetime import datetime
	from email.policy import default # For modern email parsing
	from tqdm import tqdm # For progress bar; install with pip if needed
	import mmap # For memory-mapped file reading

	# User configuration – replace with your details
	USER_EMAILS = ["myemail@gmail.com", "my.email@gmail.com"] # Your Gmail addresses to identify sent emails
	USER_NAME = "My Name" # Your name for filtering or display
	MBOX_FILE = r"G:\My Path to Mbox\All mail Including Spam and Trash.mbox" # Path to your Gmail MBOX file from Google Takeout

	# Custom MboxReader using mmap for efficient sequential parsing of large files
	class MboxReader:
	def __init__(self, filename):
	self.file = open(filename, 'rb')
	self.mm = mmap.mmap(self.file.fileno(), 0, access=mmap.ACCESS_READ)
	self.pos = 0
	line = self._readline()
	if not line.startswith(b'From '):
	raise ValueError("File does not start with 'From ' – not a valid mbox")

	def __enter__(self):
	return self

	def __exit__(self, exc_type, exc_value, exc_traceback):
	self.mm.close()
	self.file.close()

	def __iter__(self):
	return self

	def __next__(self):
	lines = []
	while True:
	line = self._readline()
	if not line or line.startswith(b'From '):
	if lines:
	return email.message_from_bytes(b''.join(lines), policy=default)
	if not line:
	raise StopIteration
	continue
	# Handle escaped 'From ' lines
	if line.startswith(b'>From '):
	line = line[1:]
	lines.append(line)

	def _readline(self):
	start = self.pos
	end = self.mm.find(b'\n', start) + 1
	if end == 0: # No more lines
	line = self.mm[start:]
	self.pos = len(self.mm)
	else:
	line = self.mm[start:end]
	self.pos = end
	return line

	# Helper function to safely decode text with fallback
	def safe_decode(text, charset):
	if not isinstance(text, bytes):
	return text
	try:
	return text.decode(charset or 'utf-8', errors='ignore')
	except (LookupError, UnicodeDecodeError):
	# Fallback to utf-8 with replace if charset invalid or decode fails
	return text.decode('utf-8', errors='replace')

	# Function to decode headers (e.g., subject)
	def decode_header_text(header):
	decoded = decode_header(header or "")
	return ''.join([safe_decode(text, charset) for text, charset in decoded])

	# Function to strip HTML tags
	def strip_html(text):
	text = re.sub(r'<[^>]+>', '', text)
	return html.unescape(text)

	# Function to extract plain text body (prefers text/plain, falls back to text/html)
	def get_text_body(msg):
	if msg.is_multipart():
	for part in msg.get_payload():
	if part.get_content_type() == 'text/plain':
	return part.get_payload(decode=True).decode(errors='ignore').strip()
	elif part.get_content_type() == 'text/html':
	html_body = part.get_payload(decode=True).decode(errors='ignore')
	return strip_html(html_body).strip()
	else:
	if msg.get_content_type() == 'text/plain':
	return msg.get_payload(decode=True).decode(errors='ignore').strip()
	elif msg.get_content_type() == 'text/html':
	html_body = msg.get_payload(decode=True).decode(errors='ignore')
	return strip_html(html_body).strip()
	return ""

	# Function to parse date, making it timezone-naive
	def parse_date(date_str):
	try:
	dt = email.utils.parsedate_to_datetime(date_str)
	return dt.replace(tzinfo=None) # Strip timezone to make naive
	except:
	return datetime.min

	# Process messages with progress
	messages = {}
	id_to_children = defaultdict(list)
	all_msg_ids = set()

	with MboxReader(MBOX_FILE) as mbox:
	for msg in tqdm(mbox, desc="Parsing emails"): # Progress bar here
	msgid = msg['Message-ID']
	if not msgid:
	continue
	all_msg_ids.add(msgid)

	from_header = msg['From']
	if not from_header:
	continue
	name, addr = email.utils.parseaddr(from_header)
	name = decode_header_text(name)
	addr = addr.lower()

	to_header = msg['To']
	to_name, to_addr = email.utils.parseaddr(to_header) if to_header else ("", "")
	to_name = decode_header_text(to_name)

	subject = decode_header_text(msg['Subject'])
	body = get_text_body(msg)

	# Skip empty or very short emails early
	if not body or len(body) < 50:
	continue

	# Basic cleaning: remove quoted text (lines starting with >)
	body = re.sub(r'(?m)^>.*\n?', '', body).strip()

	# Skip if still too short
	if len(body) < 50:
	continue

	in_reply_to = msg['In-Reply-To']

	messages[msgid] = {
	"msgid": msgid,
	"from_name": name,
	"from_addr": addr,
	"to_name": to_name,
	"to_addr": to_addr,
	"subject": subject,
	"body": body,
	"date": msg['Date'],
	"parsed_date": parse_date(msg['Date']),
	"in_reply_to": in_reply_to,
	"is_sent": addr in [e.lower() for e in USER_EMAILS] and name == USER_NAME
	}

	if in_reply_to:
	id_to_children[in_reply_to].append(msgid)

	# Find root messages (no in_reply_to or in_reply_to not in messages)
	roots = [mid for mid in messages if not messages[mid]["in_reply_to"] or messages[mid]["in_reply_to"] not in all_msg_ids]

	# Function to build thread recursively
	def build_thread(mid, visited=set()):
	if mid in visited:
	return []
	visited.add(mid)
	thread = [messages[mid]]
	for child in sorted(id_to_children[mid], key=lambda c: messages.get(c, {}).get("parsed_date", datetime.min)):
	thread.extend(build_thread(child, visited))
	return thread

	# Build all threads with progress
	threads = []
	for root in tqdm(roots, desc="Building threads"):
	thread_msgs = build_thread(root)
	# Only include threads with at least one sent message by you
	if any(m["is_sent"] for m in thread_msgs):
	threads.append(thread_msgs)

	# Format into conversational dataset and save as JSONL
	with open('gmail_conversations.jsonl', 'w', encoding='utf-8') as f:
	for thread in tqdm(threads, desc="Formatting and saving data"):
	if len(thread) <= 1: # Skip single-message threads
	continue
	conversation = []
	thread_subject = thread[0]["subject"] # Use the root subject
	for msg in thread:
	role = "assistant" if msg["is_sent"] else "user"
	from_name = msg["from_name"] or msg["from_addr"]
	content = f"From: {from_name}\nSubject: {msg['subject']}\n\n{msg['body']}"
	conversation.append({"role": role, "content": content})

	# Write each conversation as a JSON line
	json.dump({
	"conversation": conversation,
	"subject": thread_subject
	}, f)
	f.write('\n')

	print("Conversational dataset prepared! Saved to 'gmail_conversations.jsonl'")

view raw

gmail_prepare_data.py

hosted with ❤ by GitHub

gmail_finetune.py

	# Disclaimer: Attempted reworking the tutorial for GMail conversation over X.com.
	# MAYBE A STARTING POINT FOR SOMEONE, AFTER SPENDING AN ENTIRE WEEKEND PUTTING IT ALL TOGETHER, THIS DID NOT WORK
	# https://readmultiplex.com/2025/08/17/create-your-own-personal-local-ai-expert-using-your-x-data-a-complete-beginners-guide/
	# after 4 hours of training model and GMail. The model was all over the place, and did not work.
	# Most of what is here was assisted by Grok AI 4
	from peft import get_peft_model, LoraConfig
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	from trl import SFTTrainer, SFTConfig
	from datasets import load_dataset
	from huggingface_hub import login
	import torch
	import multiprocessing
	import os

	# Suppress Hugging Face symlink warning (optional)
	os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "true"

	model_name = "google/gemma-3-270m"

	# Load tokenizer (standard for early mapping)
	tokenizer = AutoTokenizer.from_pretrained(
	model_name,
	token="TOKEN GOES HERE" # Your token for gated model
	)

	# Set custom chat template for Gemma
	tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}<start_of_turn>user\n{{ message['content'] \| trim }}<end_of_turn>\n{% elif message['role'] == 'model' %}<start_of_turn>model\n{{ message['content'] \| trim }}<end_of_turn>\n{% elif message['role'] == 'system' %}<start_of_turn>system\n{{ message['content'] \| trim }}<end_of_turn>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<start_of_turn>model\n{% endif %}"

	# Formatting function to apply chat template
	def formatting_prompts_func(examples):
	convos = examples["conversation"]
	texts = []
	for convo in convos:
	convo_mapped = [{"role": "model" if msg["role"] == "assistant" else msg["role"], "content": msg["content"]} for msg in convo]
	convo_mapped.insert(0, {"role": "system", "content": "You are [[ Name Goes Here ]], responding in your personal style."})
	text = tokenizer.apply_chat_template(convo_mapped, tokenize=False, add_generation_prompt=False)
	texts.append(text)
	return {"text": texts}

	if __name__ == '__main__':
	multiprocessing.freeze_support()

	login(token="TOKEN GOES HERE") # Your token

	dataset = load_dataset('json', data_files='gmail_conversations.jsonl', split='train')
	dataset = dataset.map(formatting_prompts_func, batched=True, num_proc=12) # Use your Ryzen's cores for mapping

	# Define quantization config to replace deprecated load_in_4bit
	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type="nf4"
	)

	# Load the base model on GPU with 4-bit quantization
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	quantization_config=quantization_config,
	torch_dtype=torch.bfloat16, # bfloat16 for GPU
	device_map="auto", # Auto-map to GPU
	attn_implementation="eager", # For Gemma-3 stability
	token="TOKEN GOES HERE" # Your token
	)

	# Add LoRA for efficient fine-tuning
	peft_config = LoraConfig(
	r=32, # Original rank
	lora_alpha=32,
	lora_dropout=0,
	bias="none",
	target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
	use_rslora=True # For stability
	)
	model = get_peft_model(model, peft_config)
	model.enable_input_require_grads() # For gradient checkpointing

	# Enable gradient checkpointing
	model.gradient_checkpointing_enable()

	# Calculate max_steps (aim for ~3 epochs)
	effective_batch_size = 2 * 2 # Reduced to avoid OOM
	max_steps = max(60, (len(dataset) // effective_batch_size) * 3)

	# Train on GPU
	trainer = SFTTrainer(
	model=model,
	processing_class=tokenizer,
	train_dataset=dataset,
	args=SFTConfig(
	per_device_train_batch_size=2, # Reduced to avoid OOM
	gradient_accumulation_steps=2,
	warmup_steps=20,
	max_steps=max_steps,
	learning_rate=2e-4,
	fp16=False, # Disable to bypass scaler error
	bf16=True, # Enable bf16 for GPU
	logging_steps=1,
	output_dir="my_personal_ai",
	optim="adamw_torch", # Standard optimizer
	gradient_checkpointing=True, # Save memory
	save_steps=100,
	logging_dir="my_personal_ai/logs",
	dataloader_num_workers=4, # Moderate for GPU
	dataset_text_field="text", # Moved to SFTConfig
	max_length=2048, # Fixed name
	packing=False # Moved to SFTConfig
	)
	)
	trainer.train()

	# Save the fine-tuned LoRA adapter
	model.save_pretrained("my_personal_ai_model")

	# Merge LoRA with base model for easier use
	from peft import PeftModel
	merged_model = PeftModel.from_pretrained(
	AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", token="TOKEN GOES HERE"),
	"my_personal_ai_model"
	).merge_and_unload()
	merged_model.save_pretrained("my_personal_ai_merged")

	print("Fine-tuning complete! Model saved to 'my_personal_ai_merged'.")

	# For GGUF export (optional, requires llama.cpp):
	# git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make
	# python convert_hf_to_gguf.py –outfile my_personal_ai_gguf.gguf –outtype q8_0 my_personal_ai_merged

view raw

gmail_finetune.py

hosted with ❤ by GitHub

After setting up CUDA, and a number of back and forths with gmail_finetune.py in getting the settings just right. 4 hours of training time and going through the painful process of converting it to a gguf with llama.cp.

Loaded the model up in LM Studio, and the model acted bonkers on replies. Anything I questioned it’s name, took it’s name. Said it was from Toronto Canada, studies art, etc. Just far opposite, nothing related to me. The reply and responses sometimes would loop.

Recommendations for tutorial:

Python. If Windows 10 install, recommend installing on same disk as OS. Nightmare to uninstall, and pip install never worked if not on OS disk. Lots of environmental PATH rework to uninstall and start fresh.

Overall, great work on this guide – it’s inspiring more of us to experiment. Looking forward to any updates or thoughts from the community!

August 25, 2025 at 11:34 pm in reply to: Create Your Own Personal Local AI Expert Using Your X Data #29670

Drumsin

Participant

Offline

@drumsin

After going through all the steps, and attempting at tailoring it for GMail conversations. There is no doubt in my mind that this stuff will eventually be baked into local client apps within the next year or two.

A very easy UI to reference data to build a model from:
GMail.mbox
iCloud messages
Etc.

Without the pain of navigating Python exports and the numerous libraries, packages, and dependencies to make it happen.

I’m thinking maybe it’s just best to wait.

This reply was modified 1 month, 1 week ago by Drumsin.

Author

Posts

Viewing 2 posts - 1 through 2 (of 2 total)