Simone Filosofi — Data Scientist & ML Engineer

Get to know me

About Me

Who I Am

I’m a Data Science MSc student at LUISS Rome by day and a debugger of mysterious model behavior by night. I call it research, my laptop calls it a cry for help.

I hold a BSc in Computer Science & Management. My thesis, where I applied NLP to the GenAI job market, was a fascinating dive into the future of work and mostly confirmed my suspicion: the machines are just as confused about the '5 years of experience in a 2-year-old technology' requirement as we are.

I also survived a Maymester at USC studying Probability Theory, which mostly taught me that unlikely things happen all the time.

Let's Connect

GitHub LinkedIn Email

Technical Skills

Languages

Python SQL R

ML & AI

Machine Learning Scikit-learn XGBoost NLP HuggingFace Transformers Algorithms

LLMs & RAG

LangChain LlamaIndex pgvector

Data

Pandas NumPy SMOTE Jupyter Web Scraping

Databases

MySQL MongoDB Supabase

Dev & Infra

FastAPI React Git Docker CI/CD

Analytics

Power BI NetworkX Data Visualization

My background

Resume

Education

Sep 2025 — Ongoing

MSc Data Science

LUISS University Rome

Advanced Statistics, Machine Learning, Data Science, Data Privacy & Security: pursuing the mathematical foundations of AI at LUISS while concurrently driving technical initiatives for the Google Developers Club.

Sep 2022 — Jun 2025

BSc Computer Science & Management

LUISS University Rome

Thesis: "Decoding the GenAI Workforce: an NLP and ML analysis of evolving U.S. labor market demands, featuring a Healthcare deep dive."

110 cum laude

Read my thesis

June 2024

Math 407 — Probability Theory

University of Southern California

Exchange Maymester coursework in advanced probability theory and statistical inference.

Research & Experience

February 2026 — Ongoing

Researcher and IT member

Google Developers Club

Developing practical campus solutions, including a browser-based Apple Wallet integration for student badges adopted by the LUISS community. Additionally, I am architecting internal RAG models to streamline knowledge management and operations for the Google Developers Group.

Apr 2025 — Sep 2025

AI Tutor

Make4Work — Rome, Italy

Delivered two full editions of a specialised AI course for schoolteachers, covering foundations through to practical classroom applications. Designed hands-on activities, facilitated live discussions and helped educators move from "AI sounds scary" to "AI is a tool I can actually use" — in six months flat.

Oct 2024 — Dec 2024

Data Analytics Intern

Procter & Gamble

Conducted exploratory data analysis on sales and marketing datasets to identify trends and opportunities. Developed predictive models to support inventory management and pricing strategies. Utilized Python and SQL for data manipulation and visualization.

May 2023 — Mar 2024

Pre-seed AI Engineer

FireGen AI

Partnered with the founder to conceptualize and build the initial MVP from the ground up. I focused on high-level reasoning frameworks and retrieval logic for the core RAG system, while implementing primary API integrations to deliver functional AI solutions for early-stage enterprise testing.

What I've built

Projects

LUISS-badge wallet integration

A tool thought for students, to turn your LUISS badge QR code into an Apple Wallet pass — so you can tap in from your lock screen without opening an app, logging in or resetting your password in the rain at 8am. Upload your QR code, add your name, download the pass. That's it.

Javascript HTML CSS

RAGnarok

Free, multi-user RAG app for PDF Q&A with streamed, citation-backed responses. Features semantic search via HNSW indexing, complete user isolation through JWT + Row-Level Security, and a BYOK Groq API integration — zero-cost hosting, production-grade security.

Python React FastAPI Supabase pgvector Groq HuggingFace

Decoding the GenAI Workforce

BSc thesis. Empirical analysis of 2,726 U.S. job postings (2023–2025) tracking how generative AI reshaped labor demand. Combines TF-IDF + fuzzy matching + SVM for job title standardisation (97% accuracy), LDA topic modelling across 9 clusters, and temporal trend analysis — from "uses ChatGPT" to "builds proprietary GenAI".

Python NLP scikit-learn gensim spaCy Plotly

Brain

Minimal CLI note-taking tool backed by SQLite. Add, search, edit, and delete notes from the terminal — with Rich formatting because plain text is boring. Built because my actual brain is busy forgetting things.

Python SQLite Typer Rich

STLA & WMT Financial Analysis

Comparative financial deep-dive on Stellantis and Walmart. Covers 5-year balance sheet trends, CAPM beta estimation, stock and bond valuation (DDM + comparables), capital structure analysis, and portfolio risk-return optimisation — using historical Yahoo Finance data.

Python Jupyter Pandas CAPM

Job Market Analytics Platform

Dual SQL/NoSQL analytics platform over a job listings dataset. MySQL handles relational queries — gender wage gaps, salary by degree, company presence by portal. MongoDB handles document queries — top skills by company, executive analysis by sector. Same problem, two paradigms.

Python MySQL MongoDB PyMongo

Accenture Automotive Analysis

Consulting case study on the viability of a Guaranteed Used Vehicle program for a luxury automotive brand. Analyses depreciation patterns across GT, SUV, and EV segments using ~2,400 French used car listings. Key finding: the program isn't profitable at a 5% margin without repricing.

Python Jupyter Pandas Seaborn

Social Network Analysis — Forrest Gump

Graph analysis of character interactions in Forrest Gump. Implements betweenness, closeness, and decay centrality from scratch, custom PageRank, three community detection algorithms, and link prediction (Jaccard, Adamic-Adar, Resource Allocation). Visualised in Gephi.

Python NetworkX Gephi Graph Theory

Customer Satisfaction Classifier

ML pipeline predicting passenger satisfaction for ThomasTrain without direct feedback. Compares Logistic Regression, Decision Trees, and Random Forest with full hyperparameter tuning. Top finding: boarding experience and travel purpose (leisure vs. business) are the strongest predictors.

Python scikit-learn Random Forest Jupyter

Telecom Churn Analysis

Full ML pipeline to predict customer churn in the US telecom industry. Covers linear models with AIC/BIC stepwise selection, penalised regression (Ridge, Lasso, Elastic Net), non-linear methods (k-NN, GAM, tree ensembles) and clustering (K-Means + Hierarchical) for customer segmentation — all in R.

R Machine Learning Clustering R Markdown

I'm Simone Filosofi