Drumsin

Forum Replies Created

Viewing 2 posts - 1 through 2 (of 2 total)
  • Author
    Posts
  • in reply to: Create Your Own Personal Local AI Expert Using Your X Data #29669
    Drumsin
    Participant
      • Offline
      • @drumsin

      A good tutorial to get started and to learn the lingo. My overall result was a failure, but that’s okay. Learned some things along the way.

      I think this tutorial should switch gears from using X likes and conversation to something more personable. Brian, your posts on X are always reflective of being more human. I think this tutorial would go a long way if you can switch to GMail, and text message correspondence from iCloud.

      I don’t reply or post anything on X because of the nature of social media and companies. Companies now have contracts outlining social media as being dangerous and something to avoid. Unless you work for yourself, or you’re a major figure in the industry, it makes no sense to have a opinion on anything publicly. No reason to risk of lose your position in society or job from a post.

      After 15 years in web dev industry, much of this stuff is foreign to me, and it’s been quite a humbling experience with this new tech. Not to the mention the number of issues I ran into. Web Developer 15+ years, but also I drew outside the areas with a crayon trying something different.

      I attempted to use Grok 4 to guide me through the process of reconfiguring the tutorial for GMail. Disclaimer: I am no Python expert. I provided Grok the existing code examples (X data) from the tutorial, and told it to make it work for GMail conversations.

      DISCLAIMER FOR LOOKING: THESE GISTS DO NOT WORK, BUT MAYBE TO GIVE INSIGHT WHY MY MODEL FAILED.

      gmail_prepare_data.py


      # Disclaimer: Attempted reworking the tutorial for GMail conversation over X.com.
      # MAYBE A STARTING POINT FOR SOMEONE, AFTER SPENDING AN ENTIRE WEEKEND PUTTING IT ALL TOGETHER, THIS DID NOT WORK
      # https://readmultiplex.com/2025/08/17/create-your-own-personal-local-ai-expert-using-your-x-data-a-complete-beginners-guide/
      # after 4 hours of training model and GMail. The model was all over the place, and did not work.
      # Most of what is here was assisted by Grok AI 4
      import json
      import re
      import html
      import mailbox
      import email.utils
      from email.header import decode_header
      from datasets import Dataset
      from collections import defaultdict
      from datetime import datetime
      from email.policy import default # For modern email parsing
      from tqdm import tqdm # For progress bar; install with pip if needed
      import mmap # For memory-mapped file reading
      # User configuration – replace with your details
      USER_EMAILS = ["myemail@gmail.com", "my.email@gmail.com"] # Your Gmail addresses to identify sent emails
      USER_NAME = "My Name" # Your name for filtering or display
      MBOX_FILE = r"G:\My Path to Mbox\All mail Including Spam and Trash.mbox" # Path to your Gmail MBOX file from Google Takeout
      # Custom MboxReader using mmap for efficient sequential parsing of large files
      class MboxReader:
      def __init__(self, filename):
      self.file = open(filename, 'rb')
      self.mm = mmap.mmap(self.file.fileno(), 0, access=mmap.ACCESS_READ)
      self.pos = 0
      line = self._readline()
      if not line.startswith(b'From '):
      raise ValueError("File does not start with 'From ' – not a valid mbox")
      def __enter__(self):
      return self
      def __exit__(self, exc_type, exc_value, exc_traceback):
      self.mm.close()
      self.file.close()
      def __iter__(self):
      return self
      def __next__(self):
      lines = []
      while True:
      line = self._readline()
      if not line or line.startswith(b'From '):
      if lines:
      return email.message_from_bytes(b''.join(lines), policy=default)
      if not line:
      raise StopIteration
      continue
      # Handle escaped 'From ' lines
      if line.startswith(b'>From '):
      line = line[1:]
      lines.append(line)
      def _readline(self):
      start = self.pos
      end = self.mm.find(b'\n', start) + 1
      if end == 0: # No more lines
      line = self.mm[start:]
      self.pos = len(self.mm)
      else:
      line = self.mm[start:end]
      self.pos = end
      return line
      # Helper function to safely decode text with fallback
      def safe_decode(text, charset):
      if not isinstance(text, bytes):
      return text
      try:
      return text.decode(charset or 'utf-8', errors='ignore')
      except (LookupError, UnicodeDecodeError):
      # Fallback to utf-8 with replace if charset invalid or decode fails
      return text.decode('utf-8', errors='replace')
      # Function to decode headers (e.g., subject)
      def decode_header_text(header):
      decoded = decode_header(header or "")
      return ''.join([safe_decode(text, charset) for text, charset in decoded])
      # Function to strip HTML tags
      def strip_html(text):
      text = re.sub(r'<[^>]+>', '', text)
      return html.unescape(text)
      # Function to extract plain text body (prefers text/plain, falls back to text/html)
      def get_text_body(msg):
      if msg.is_multipart():
      for part in msg.get_payload():
      if part.get_content_type() == 'text/plain':
      return part.get_payload(decode=True).decode(errors='ignore').strip()
      elif part.get_content_type() == 'text/html':
      html_body = part.get_payload(decode=True).decode(errors='ignore')
      return strip_html(html_body).strip()
      else:
      if msg.get_content_type() == 'text/plain':
      return msg.get_payload(decode=True).decode(errors='ignore').strip()
      elif msg.get_content_type() == 'text/html':
      html_body = msg.get_payload(decode=True).decode(errors='ignore')
      return strip_html(html_body).strip()
      return ""
      # Function to parse date, making it timezone-naive
      def parse_date(date_str):
      try:
      dt = email.utils.parsedate_to_datetime(date_str)
      return dt.replace(tzinfo=None) # Strip timezone to make naive
      except:
      return datetime.min
      # Process messages with progress
      messages = {}
      id_to_children = defaultdict(list)
      all_msg_ids = set()
      with MboxReader(MBOX_FILE) as mbox:
      for msg in tqdm(mbox, desc="Parsing emails"): # Progress bar here
      msgid = msg['Message-ID']
      if not msgid:
      continue
      all_msg_ids.add(msgid)
      from_header = msg['From']
      if not from_header:
      continue
      name, addr = email.utils.parseaddr(from_header)
      name = decode_header_text(name)
      addr = addr.lower()
      to_header = msg['To']
      to_name, to_addr = email.utils.parseaddr(to_header) if to_header else ("", "")
      to_name = decode_header_text(to_name)
      subject = decode_header_text(msg['Subject'])
      body = get_text_body(msg)
      # Skip empty or very short emails early
      if not body or len(body) < 50:
      continue
      # Basic cleaning: remove quoted text (lines starting with >)
      body = re.sub(r'(?m)^>.*\n?', '', body).strip()
      # Skip if still too short
      if len(body) < 50:
      continue
      in_reply_to = msg['In-Reply-To']
      messages[msgid] = {
      "msgid": msgid,
      "from_name": name,
      "from_addr": addr,
      "to_name": to_name,
      "to_addr": to_addr,
      "subject": subject,
      "body": body,
      "date": msg['Date'],
      "parsed_date": parse_date(msg['Date']),
      "in_reply_to": in_reply_to,
      "is_sent": addr in [e.lower() for e in USER_EMAILS] and name == USER_NAME
      }
      if in_reply_to:
      id_to_children[in_reply_to].append(msgid)
      # Find root messages (no in_reply_to or in_reply_to not in messages)
      roots = [mid for mid in messages if not messages[mid]["in_reply_to"] or messages[mid]["in_reply_to"] not in all_msg_ids]
      # Function to build thread recursively
      def build_thread(mid, visited=set()):
      if mid in visited:
      return []
      visited.add(mid)
      thread = [messages[mid]]
      for child in sorted(id_to_children[mid], key=lambda c: messages.get(c, {}).get("parsed_date", datetime.min)):
      thread.extend(build_thread(child, visited))
      return thread
      # Build all threads with progress
      threads = []
      for root in tqdm(roots, desc="Building threads"):
      thread_msgs = build_thread(root)
      # Only include threads with at least one sent message by you
      if any(m["is_sent"] for m in thread_msgs):
      threads.append(thread_msgs)
      # Format into conversational dataset and save as JSONL
      with open('gmail_conversations.jsonl', 'w', encoding='utf-8') as f:
      for thread in tqdm(threads, desc="Formatting and saving data"):
      if len(thread) <= 1: # Skip single-message threads
      continue
      conversation = []
      thread_subject = thread[0]["subject"] # Use the root subject
      for msg in thread:
      role = "assistant" if msg["is_sent"] else "user"
      from_name = msg["from_name"] or msg["from_addr"]
      content = f"From: {from_name}\nSubject: {msg['subject']}\n\n{msg['body']}"
      conversation.append({"role": role, "content": content})
      # Write each conversation as a JSON line
      json.dump({
      "conversation": conversation,
      "subject": thread_subject
      }, f)
      f.write('\n')
      print("Conversational dataset prepared! Saved to 'gmail_conversations.jsonl'")

      gmail_finetune.py


      # Disclaimer: Attempted reworking the tutorial for GMail conversation over X.com.
      # MAYBE A STARTING POINT FOR SOMEONE, AFTER SPENDING AN ENTIRE WEEKEND PUTTING IT ALL TOGETHER, THIS DID NOT WORK
      # https://readmultiplex.com/2025/08/17/create-your-own-personal-local-ai-expert-using-your-x-data-a-complete-beginners-guide/
      # after 4 hours of training model and GMail. The model was all over the place, and did not work.
      # Most of what is here was assisted by Grok AI 4
      from peft import get_peft_model, LoraConfig
      from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
      from trl import SFTTrainer, SFTConfig
      from datasets import load_dataset
      from huggingface_hub import login
      import torch
      import multiprocessing
      import os
      # Suppress Hugging Face symlink warning (optional)
      os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "true"
      model_name = "google/gemma-3-270m"
      # Load tokenizer (standard for early mapping)
      tokenizer = AutoTokenizer.from_pretrained(
      model_name,
      token="TOKEN GOES HERE" # Your token for gated model
      )
      # Set custom chat template for Gemma
      tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{% if message['role'] == 'user' %}<start_of_turn>user\n{{ message['content'] | trim }}<end_of_turn>\n{% elif message['role'] == 'model' %}<start_of_turn>model\n{{ message['content'] | trim }}<end_of_turn>\n{% elif message['role'] == 'system' %}<start_of_turn>system\n{{ message['content'] | trim }}<end_of_turn>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<start_of_turn>model\n{% endif %}"
      # Formatting function to apply chat template
      def formatting_prompts_func(examples):
      convos = examples["conversation"]
      texts = []
      for convo in convos:
      convo_mapped = [{"role": "model" if msg["role"] == "assistant" else msg["role"], "content": msg["content"]} for msg in convo]
      convo_mapped.insert(0, {"role": "system", "content": "You are [[ Name Goes Here ]], responding in your personal style."})
      text = tokenizer.apply_chat_template(convo_mapped, tokenize=False, add_generation_prompt=False)
      texts.append(text)
      return {"text": texts}
      if __name__ == '__main__':
      multiprocessing.freeze_support()
      login(token="TOKEN GOES HERE") # Your token
      dataset = load_dataset('json', data_files='gmail_conversations.jsonl', split='train')
      dataset = dataset.map(formatting_prompts_func, batched=True, num_proc=12) # Use your Ryzen's cores for mapping
      # Define quantization config to replace deprecated load_in_4bit
      quantization_config = BitsAndBytesConfig(
      load_in_4bit=True,
      bnb_4bit_compute_dtype=torch.bfloat16,
      bnb_4bit_use_double_quant=True,
      bnb_4bit_quant_type="nf4"
      )
      # Load the base model on GPU with 4-bit quantization
      model = AutoModelForCausalLM.from_pretrained(
      model_name,
      quantization_config=quantization_config,
      torch_dtype=torch.bfloat16, # bfloat16 for GPU
      device_map="auto", # Auto-map to GPU
      attn_implementation="eager", # For Gemma-3 stability
      token="TOKEN GOES HERE" # Your token
      )
      # Add LoRA for efficient fine-tuning
      peft_config = LoraConfig(
      r=32, # Original rank
      lora_alpha=32,
      lora_dropout=0,
      bias="none",
      target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
      use_rslora=True # For stability
      )
      model = get_peft_model(model, peft_config)
      model.enable_input_require_grads() # For gradient checkpointing
      # Enable gradient checkpointing
      model.gradient_checkpointing_enable()
      # Calculate max_steps (aim for ~3 epochs)
      effective_batch_size = 2 * 2 # Reduced to avoid OOM
      max_steps = max(60, (len(dataset) // effective_batch_size) * 3)
      # Train on GPU
      trainer = SFTTrainer(
      model=model,
      processing_class=tokenizer,
      train_dataset=dataset,
      args=SFTConfig(
      per_device_train_batch_size=2, # Reduced to avoid OOM
      gradient_accumulation_steps=2,
      warmup_steps=20,
      max_steps=max_steps,
      learning_rate=2e-4,
      fp16=False, # Disable to bypass scaler error
      bf16=True, # Enable bf16 for GPU
      logging_steps=1,
      output_dir="my_personal_ai",
      optim="adamw_torch", # Standard optimizer
      gradient_checkpointing=True, # Save memory
      save_steps=100,
      logging_dir="my_personal_ai/logs",
      dataloader_num_workers=4, # Moderate for GPU
      dataset_text_field="text", # Moved to SFTConfig
      max_length=2048, # Fixed name
      packing=False # Moved to SFTConfig
      )
      )
      trainer.train()
      # Save the fine-tuned LoRA adapter
      model.save_pretrained("my_personal_ai_model")
      # Merge LoRA with base model for easier use
      from peft import PeftModel
      merged_model = PeftModel.from_pretrained(
      AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto", token="TOKEN GOES HERE"),
      "my_personal_ai_model"
      ).merge_and_unload()
      merged_model.save_pretrained("my_personal_ai_merged")
      print("Fine-tuning complete! Model saved to 'my_personal_ai_merged'.")
      # For GGUF export (optional, requires llama.cpp):
      # git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make
      # python convert_hf_to_gguf.py –outfile my_personal_ai_gguf.gguf –outtype q8_0 my_personal_ai_merged

      After setting up CUDA, and a number of back and forths with gmail_finetune.py in getting the settings just right. 4 hours of training time and going through the painful process of converting it to a gguf with llama.cp.

      Loaded the model up in LM Studio, and the model acted bonkers on replies. Anything I questioned it’s name, took it’s name. Said it was from Toronto Canada, studies art, etc. Just far opposite, nothing related to me. The reply and responses sometimes would loop.

      Recommendations for tutorial:

      Python. If Windows 10 install, recommend installing on same disk as OS. Nightmare to uninstall, and pip install never worked if not on OS disk. Lots of environmental PATH rework to uninstall and start fresh.

      Overall, great work on this guide – it’s inspiring more of us to experiment. Looking forward to any updates or thoughts from the community!

      in reply to: Create Your Own Personal Local AI Expert Using Your X Data #29670
      Drumsin
      Participant
        • Offline
        • @drumsin

        After going through all the steps, and attempting at tailoring it for GMail conversations. There is no doubt in my mind that this stuff will eventually be baked into local client apps within the next year or two.

        A very easy UI to reference data to build a model from:
        GMail.mbox
        iCloud messages
        Etc.

        Without the pain of navigating Python exports and the numerous libraries, packages, and dependencies to make it happen.

        I’m thinking maybe it’s just best to wait.

        • This reply was modified 1 month, 1 week ago by Drumsin.
      Viewing 2 posts - 1 through 2 (of 2 total)