
Datanalysis Credit Risk
Run credit-risk variable selection, PSI stability checks, and LightGBM AUC screening on tabular loan or risk datasets before you ship a scoring model.
Overview
Dataanalysis Credit Risk is an agent skill for the Validate phase that automates credit-risk variable selection, cohort month filtering, and LightGBM AUC evaluation on tabular data.
Install
npx skills add https://github.com/github/awesome-copilot --skill datanalysis-credit-riskWhat is this skill?
- Filters abnormal year-month buckets by minimum bad and total sample counts
- Reuses PSI calculation from shared func.py while analysis.py handles variable selection
- Exports styled Excel workbooks via openpyxl for audit-friendly review
- Trains LightGBM with parallel joblib workers and reports ROC-AUC on holdout data
- Uses toad for credit-risk style feature screening on pandas DataFrames
Adoption & trust: 7k installs on skills.sh; 34.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have raw monthly credit data but no consistent way to drop thin cohorts, screen variables, and quantify model lift before committing to a production scorecard.
Who is it for?
Indie fintech or internal analytics builders with pandas-ready extracts, a binary target, and new_date_ym-style time keys who need a structured ML prep ritual.
Skip if: Builders without labeled credit data, teams that need regulated model documentation only (no Python execution), or pure frontend demos with no tabular risk feed.
When should I use this skill?
You have monthly credit or lending data with a binary target and need automated month filtering, variable screening, and AUC evaluation before production modeling.
What do I get? / Deliverables
You get cleaned cohorts, abnormal-month documentation, screened feature sets, and AUC-backed evidence to decide whether to proceed to full build and deployment workflows.
- Filtered modeling DataFrame and abnormal-month report
- Variable selection output suitable for Excel review
- Holdout ROC-AUC metric from LightGBM training
Recommended Skills
Journey fit
Canonical shelf is Validate because the skill proves whether features and months are stable enough to trust before full product build and production scoring. Prototype fits model-variable iteration, train/test splits, and ROC-AUC evaluation on sample cohorts—not polished app UI.
How it compares
Use instead of ad-hoc Jupyter chat for PSI and IV-style screening when you want a fixed procedural skill rather than generic data-science Q&A.
Common Questions / FAQ
Who is datanalysis-credit-risk for?
Solo and indie builders shipping lending, fraud, or credit-analytics prototypes who want an agent to run cohort filters, variable selection, and LightGBM AUC checks on structured monthly data.
When should I use datanalysis-credit-risk?
During Validate when you are proving a scoring approach: after you have extracts but before you harden APIs and production monitors—especially when you need PSI-oriented stability and minimum sample rules on year-month buckets.
Is datanalysis-credit-risk safe to install?
Treat it like any third-party skill that runs Python on your data: review the Security Audits panel on this Prism page, inspect the repo, and run only on anonymized or permitted datasets in isolated environments.
SKILL.md
READMESKILL.md - Datanalysis Credit Risk
"""Variable selection and analysis module - simplified version PSI calculation is reused in func.py, analysis.py only handles variable selection """ import pandas as pd import numpy as np import toad from typing import List, Dict, Tuple from openpyxl import Workbook from openpyxl.styles import Font, PatternFill, Alignment from datetime import datetime import lightgbm as lgb from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score from joblib import Parallel, delayed def drop_abnormal_ym(data: pd.DataFrame, min_ym_bad_sample: int = 1, min_ym_sample: int = 500) -> tuple: """Filter abnormal months - overall statistics, not by organization""" stat = data.groupby('new_date_ym').agg( bad_cnt=('new_target', 'sum'), total=('new_target', 'count') ).reset_index() abnormal = stat[(stat['bad_cnt'] < min_ym_bad_sample) | (stat['total'] < min_ym_sample)] abnormal = abnormal.rename(columns={'new_date_ym': '年月'}) abnormal['去除条件'] = abnormal.apply( lambda x: f'bad sample count {x["bad_cnt"]} less than {min_ym_bad_sample}' if x['bad_cnt'] < min_ym_bad_sample else f'total sample count {x["total"]} less than {min_ym_sample}', axis=1 ) if len(abnormal) > 0: data = data[~data['new_date_ym'].isin(abnormal['年月'])] # Remove empty rows abnormal = abnormal.dropna(how='all') abnormal = abnormal.reset_index(drop=True) return data, abnormal def drop_highmiss_features(data: pd.DataFrame, miss_channel: pd.DataFrame, threshold: float = 0.6) -> tuple: """Drop high missing rate features""" high_miss = miss_channel[miss_channel['整体缺失率'] > threshold].copy() high_miss['缺失率'] = high_miss['整体缺失率'] # Modify removal condition to show specific missing rate value high_miss['去除条件'] = high_miss.apply( lambda x: f'overall missing rate is {x["缺失率"]:.4f}, exceeds threshold {threshold}', axis=1 ) # Remove empty rows high_miss = high_miss.dropna(how='all') high_miss = high_miss.reset_index(drop=True) # Drop high missing rate features if len(high_miss) > 0 and '变量' in high_miss.columns: to_drop = high_miss['变量'].tolist() data = data.drop(columns=[c for c in to_drop if c in data.columns]) return data, high_miss[['变量', '缺失率', '去除条件']] def drop_lowiv_features(data: pd.DataFrame, features: List[str], overall_iv_threshold: float = 0.05, org_iv_threshold: float = 0.02, max_org_threshold: int = 8, n_jobs: int = 4) -> tuple: """Drop low IV features - multi-process version, returns IV details and IV processing table Args: overall_iv_threshold: Overall IV threshold, values below this are recorded in IV processing table org_iv_threshold: Single organization IV threshold, values below this are considered not satisfied max_org_threshold: Maximum tolerated organization count, if more than this number of organizations have IV below threshold, record in IV processing table Returns: data: Data after dropping iv_detail: IV details (IV value of each feature in each organization and overall) iv_process: IV processing table (features that do not meet the conditions) """ from references.func import calculate_iv from joblib import Parallel, delayed orgs = sorted(data['new_org'].unique()) print(f" IV calculation: feature count={len(features)}, organization count={len(orgs)}") # Calculate IV values for all organizations at once def _calc_org_iv(org): org_data = data[data['new_org'] == org] org_iv = calculate_iv(org_data, features, n_jobs=1) if len(org_iv) > 0: org_iv = org_iv.rename(columns={'IV': 'IV值'}) org_iv['机构'] = org return org_iv return None # Calculate overall IV print(f"