PostgreSQL教程FG182-PG自定义表采样方法:开发与使用
内容大纲
内容简介
本篇文章介绍PostgreSQL自定义表采样方法的开发与使用,包括表采样概述、API介绍、开发环境搭建、实现步骤、注册与测试、打包与部署等内容,风哥教程参考PostgreSQL官方文档Table Sampling
Methods等相关内容。通过本文的学习,读者将能够掌握自定义表采样方法的开发流程,为特定数据分布或查询需求创建高效的表采样方法。
Part01-基础概念与理论知识
1.1 表采样概述
表采样是PostgreSQL中用于从表中随机采样数据的功能,主要特点:
- 可扩展性:
- PostgreSQL允许开发者创建自定义表采样方法
- 支持为特定数据分布创建专用采样方法
- 灵活的表采样方法API
- 内置表采样方法:
- SYSTEM:基于块的采样,速度快但可能不够随机
- BERNOULLI:基于行的采样,随机性好但速度较慢
- 表采样的用途:
- 快速估算表的统计信息
- 测试查询性能
- 数据分析采样
- 生成测试数据
- 表采样的语法:
- SELECT * FROM table TABLESAMPLE method(argument);
- SELECT * FROM table TABLESAMPLE method(argument) REPEATABLE(seed);
1.2 表采样API
表采样方法的核心API:
- SampleScanBegin:开始表采样扫描
- SampleScanNext:获取下一个采样元组
- SampleScanEnd:结束表采样扫描
- SampleScanRescan:重新开始表采样扫描
- SampleScanExplain:解释表采样扫描
1.3 内置表采样方法
内置表采样方法的比较:
CREATE TABLE fgedu_fgfgfgsales (
id SERIAL PRIMARY KEY,
product VARCHAR(100),
amount NUMERIC,
sale_date DATE
);
— 插入测试数据
INSERT INTO fgedu_fgfgfgsales (product, amount, sale_date)
SELECT
‘Product’ || (random() * 100)::int,
random() * 1000,
‘2023-01-01’::date + (random() * 365)::int
FROM generate_series(1, 1000000);
— 使用SYSTEM采样(1%)
EXPLAIN ANALYZE SELECT * FROM fgedu_fgfgfgsales TABLESAMPLE SYSTEM(1);
— 使用BERNOULLI采样(1%)
EXPLAIN ANALYZE SELECT * FROM fgedu_fgfgfgsales TABLESAMPLE BERNOULLI(1);
— 使用REPEATABLE确保可重复
SELECT * FROM fgedu_fgfgfgsales TABLESAMPLE BERNOULLI(1) REPEATABLE(42);
— 采样后聚合
SELECT count(*) FROM fgedu_fgfgfgsales TABLESAMPLE SYSTEM(10);
Part02-生产环境规划与建议
2.1 自定义采样需求分析
自定义表采样方法的需求分析:
- 数据分布:
- 数据的分布特征是什么
- 是否有特定的采样需求
- 内置采样方法是否满足需求
- 采样目标:
- 需要什么样的随机性
- 采样的性能要求是什么
- 是否需要可重复性
- 使用场景:
- 主要用于什么场景
- 采样的频率如何
- 是否需要参数化
2.2 性能考虑
自定义表采样方法的性能考虑:
- 采样速度:
- 采样算法的时间复杂度
- 是否需要扫描整个表
- I/O开销
- 随机性:
- 随机性的质量
- 样本的代表性
- 偏差控制
- 内存使用:
- 采样过程中的内存开销
- 是否需要预分配内存
- 缓存利用
2.3 使用场景规划
自定义表采样方法的使用场景规划:
- 统计估算:
- 快速估算表的行数
- 估算列的分布
- 估算聚合值
- 数据分析:
- 探索性数据分析
- 数据挖掘采样
- 机器学习训练数据
- 性能测试:
- 查询性能测试
- 索引效果评估
- 配置调优
- 数据生成:
- 生成测试数据
- 数据脱敏
- 数据子集
Part03-生产环境项目实施方案
3.1 开发环境搭建
开发环境搭建的步骤:
apt update
apt install -y postgresql-server-dev-18 build-essential
— RHEL/CentOS
yum install -y postgresql18-devel gcc make
— 验证安装
pg_config –version
pg_config –includedir
pg_config –libdir
wget https://ftp.postgresql.org/pub/source/v18.0/postgresql-18.0.tar.gz
tar xvfz postgresql-18.0.tar.gz
cd postgresql-18.0
— 查看表采样方法示例
ls -la src/backend/access/tablesample/
cat src/backend/access/tablesample/system.c
cat src/backend/access/tablesample/bernoulli.c
3.2 表采样方法实现
简单表采样方法的实现:
#include “postgres.h”
#include “fmgr.h”
#include “access/tablesample.h”
#include “access/relscan.h”
#include “access/heapam.h”
#include “access/htup_details.h”
#include “miscadmin.h”
#include “storage/bufmgr.h”
#include “utils/rel.h”
#include “utils/relcache.h”
PG_MODULE_MAGIC;
— 采样状态结构
typedef struct
{
TableSampleData tsdata;
double percent;
uint32 seed;
BlockNumber nextblock;
HeapScanDesc scan;
} SimpleSampleData;
— 初始化采样方法
Datum
simplesample_init(PG_FUNCTION_ARGS)
{
TableSampleDesc tsdesc = (TableSampleDesc) PG_GETARG_POINTER(0);
List *args = (List *) PG_GETARG_POINTER(1);
uint32 seed = PG_GETARG_UINT32(2);
SimpleSampleData *samplestate;
double percent;
elog(INFO, “Initializing simple sample”);
— 解析参数
if (list_length(args) != 1)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg(“simple sample needs one argument”)));
percent = DatumGetFloat8(linitial(args));
if (percent < 0 || percent> 100)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg(“sample percentage must be between 0 and 100”)));
— 分配采样状态
samplestate = (SimpleSampleData *) palloc0(sizeof(SimpleSampleData));
samplestate->percent = percent;
samplestate->seed = seed;
samplestate->nextblock = 0;
— 初始化TableSampleData
samplestate->tsdata.type = T_SimpleSample;
tsdesc->tsdata = &samplestate->tsdata;
PG_RETURN_VOID();
}
— 下一样本
Datum
simplesample_next(PG_FUNCTION_ARGS)
{
TableSampleDesc tsdesc = (TableSampleDesc) PG_GETARG_POINTER(0);
SimpleSampleData *samplestate = (SimpleSampleData *) tsdesc->tsdata;
HeapTuple tuple;
bool found = false;
elog(INFO, “Getting next sample”);
— 如果没有扫描,创建一个
if (samplestate->scan == NULL)
{
samplestate->scan = heap_beginscan(tsdesc->heaprel,
tsdesc->rs_snapshot,
0, NULL);
}
— 循环直到找到采样的元组
while (!found)
{
tuple = heap_getnext(samplestate->scan, ForwardScanDirection);
if (tuple == NULL)
PG_RETURN_BOOL(false);
— 简单的随机采样
if (samplestate->percent > 0 &&
(double) (random() % 100) < samplestate->percent)
{
tsdesc->tuple = tuple;
found = true;
}
}
PG_RETURN_BOOL(true);
}
— 结束采样
Datum
simplesample_end(PG_FUNCTION_ARGS)
{
TableSampleDesc tsdesc = (TableSampleDesc) PG_GETARG_POINTER(0);
SimpleSampleData *samplestate = (SimpleSampleData *) tsdesc->tsdata;
elog(INFO, “Ending simple sample”);
if (samplestate->scan != NULL)
heap_endscan(samplestate->scan);
pfree(samplestate);
PG_RETURN_VOID();
}
— 重新扫描
Datum
simplesample_rescan(PG_FUNCTION_ARGS)
{
TableSampleDesc tsdesc = (TableSampleDesc) PG_GETARG_POINTER(0);
SimpleSampleData *samplestate = (SimpleSampleData *) tsdesc->tsdata;
elog(INFO, “Rescanning simple sample”);
if (samplestate->scan != NULL)
{
heap_rescan(samplestate->scan, NULL);
}
PG_RETURN_VOID();
}
— 注册函数
PG_FUNCTION_INFO_V1(simplesample_init);
PG_FUNCTION_INFO_V1(simplesample_next);
PG_FUNCTION_INFO_V1(simplesample_end);
PG_FUNCTION_INFO_V1(simplesample_rescan);
EXTENSION = simplesample
DATA = simplesample–1.0.sql
PGFILEDESC = “Simple Table Sampling Method”
PG_CONFIG = pg_config
PGXS := $(shell $(PG_CONFIG) –pgxs)
include $(PGXS)
default_version = ‘1.0’
module_pathname = ‘$libdir/simplesample’
relocatable = true
CREATE FUNCTION simplesample_init(internal, internal, int4)
RETURNS void
AS ‘MODULE_PATHNAME’, ‘simplesample_init’
LANGUAGE C STRICT;
CREATE FUNCTION simplesample_next(internal)
RETURNS boolean
AS ‘MODULE_PATHNAME’, ‘simplesample_next’
LANGUAGE C STRICT;
CREATE FUNCTION simplesample_end(internal)
RETURNS void
AS ‘MODULE_PATHNAME’, ‘simplesample_end’
LANGUAGE C STRICT;
CREATE FUNCTION simplesample_rescan(internal)
RETURNS void
AS ‘MODULE_PATHNAME’, ‘simplesample_rescan’
LANGUAGE C STRICT;
— 创建表采样方法
CREATE TABLESAMPLE METHOD simplesample (
INIT = simplesample_init,
NEXT = simplesample_next,
END = simplesample_end,
RESCAN = simplesample_rescan
);
3.3 注册与测试
表采样方法的注册与测试:
make
— 安装
sudo make install
— 验证安装
ls -la $(pg_config –pkglibdir)/simplesample.so
ls -la $(pg_config –sharedir)/extension/simplesample*
— 在PostgreSQL中创建扩展
psql -U pgsql -c “CREATE EXTENSION simplesample;”
— 验证表采样方法
psql -U pgsql -c “\dTsm”
— 测试表采样方法
psql -U pgsql -c “CREATE TABLE fgedu_test_table (id int, data text);”
psql -U pgsql -c “INSERT INTO test_table SELECT i, ‘data’ || i FROM generate_series(1, 10000) i;”
psql -U pgsql -c “SELECT count(*) FROM test_table TABLESAMPLE simplesample(10);”
3.4 打包与部署
表采样方法的打包与部署:
mkdir -p simplesample-1.0
cp simplesample.c Makefile simplesample.control simplesample–1.0.sql README.md simplesample-1.0/
— 创建压缩包
tar czvf simplesample-1.0.tar.gz simplesample-1.0/
— 部署到其他服务器
# 复制压缩包到目标服务器
scp simplesample-1.0.tar.gz fgedu@remote-server:/tmp/
# 在目标服务器上安装
ssh fgedu@remote-server
cd /tmp
tar xzvf simplesample-1.0.tar.gz
cd simplesample-1.0
make
sudo make install
psql -U pgsql -c “CREATE EXTENSION simplesample;”
Part04-生产案例与实战讲解
4.1 简单表采样方法
简单表采样方法的实战案例:
#include “postgres.h”
#include “fmgr.h”
#include “access/tablesample.h”
#include “access/relscan.h”
#include “access/heapam.h”
#include “access/htup_details.h”
#include “miscadmin.h”
#include “storage/bufmgr.h”
#include “utils/rel.h”
#include “utils/relcache.h”
#include “utils/array.h”
PG_MODULE_MAGIC;
— 分层采样状态结构
typedef struct
{
TableSampleData tsdata;
double percent;
uint32 seed;
HeapScanDesc scan;
int num_strata;
int *strata_counts;
int *strata_samples;
int current_stratum;
int current_sample;
} StratifiedSampleData;
— 初始化分层采样
Datum
stratifiedsample_init(PG_FUNCTION_ARGS)
{
TableSampleDesc tsdesc = (TableSampleDesc) PG_GETARG_POINTER(0);
List *args = (List *) PG_GETARG_POINTER(1);
uint32 seed = PG_GETARG_UINT32(2);
StratifiedSampleData *samplestate;
double percent;
int num_strata;
int i;
elog(INFO, “Initializing stratified sample”);
— 解析参数
if (list_length(args) != 2)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg(“stratified sample needs two arguments”)));
percent = DatumGetFloat8(linitial(args));
num_strata = DatumGetInt32(lsecond(args));
if (percent < 0 || percent> 100)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg(“sample percentage must be between 0 and 100”)));
if (num_strata < 1) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("number of
strata must be at least 1"))); -- 分配采样状态 samplestate=(StratifiedSampleData *)
palloc0(sizeof(StratifiedSampleData)); samplestate->percent = percent;
samplestate->seed = seed;
samplestate->num_strata = num_strata;
samplestate->strata_counts = (int *) palloc0(num_strata * sizeof(int));
samplestate->strata_samples = (int *) palloc0(num_strata * sizeof(int));
samplestate->current_stratum = 0;
samplestate->current_sample = 0;
— 初始化TableSampleData
samplestate->tsdata.type = T_StratifiedSample;
tsdesc->tsdata = &samplestate->tsdata;
— 先扫描一遍表,统计各层的数量
samplestate->scan = heap_beginscan(tsdesc->heaprel,
tsdesc->rs_snapshot,
0, NULL);
while (heap_getnext(samplestate->scan, ForwardScanDirection) != NULL)
{
— 简化处理:使用id的哈希值分层
int id = 0;
int stratum;
— 这里应该从元组中获取实际的分层字段
// …
stratum = abs(id) % num_strata;
samplestate->strata_counts[stratum]++;
}
heap_endscan(samplestate->scan);
samplestate->scan = NULL;
— 计算各层的采样数量
for (i = 0; i < num_strata; i++) { samplestate->strata_samples[i] =
(int) (samplestate->strata_counts[i] * percent / 100.0);
if (samplestate->strata_samples[i] == 0 && samplestate->strata_counts[i] > 0)
samplestate->strata_samples[i] = 1;
}
PG_RETURN_VOID();
}
— 其他函数实现类似…
— …
PG_FUNCTION_INFO_V1(stratifiedsample_init);
— 注册其他函数…
4.2 复杂表采样方法
复杂表采样方法的实战案例:
#include “postgres.h”
#include “fmgr.h”
#include “access/tablesample.h”
#include “access/relscan.h”
#include “access/heapam.h”
#include “access/htup_details.h”
#include “miscadmin.h”
#include “storage/bufmgr.h”
#include “utils/rel.h”
#include “utils/relcache.h”
#include “utils/array.h”
#include “utils/lsyscache.h”
PG_MODULE_MAGIC;
— 权重采样状态结构
typedef struct
{
TableSampleData tsdata;
double percent;
uint32 seed;
HeapScanDesc scan;
AttrNumber weight_attno;
double total_weight;
double *weights;
ItemPointerData *tids;
int num_tuples;
int current_index;
} WeightedSampleData;
— 初始化权重采样
Datum
weightedsample_init(PG_FUNCTION_ARGS)
{
TableSampleDesc tsdesc = (TableSampleDesc) PG_GETARG_POINTER(0);
List *args = (List *) PG_GETARG_POINTER(1);
uint32 seed = PG_GETARG_UINT32(2);
WeightedSampleData *samplestate;
double percent;
char *weight_colname;
AttrNumber weight_attno;
HeapTuple tuple;
Datum weight_datum;
bool isnull;
int i;
elog(INFO, “Initializing weighted sample”);
— 解析参数
if (list_length(args) != 2)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg(“weighted sample needs two arguments”)));
percent = DatumGetFloat8(linitial(args));
weight_colname = TextDatumGetCString(lsecond(args));
if (percent < 0 || percent> 100)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg(“sample percentage must be between 0 and 100”)));
— 查找权重列
weight_attno = get_attnum(RelationGetRelid(tsdesc->heaprel), weight_colname);
if (!AttributeNumberIsValid(weight_attno))
ereport(ERROR,
(errcode(ERRCODE_UNDEFINED_COLUMN),
errmsg(“column \”%s\” does not exist”, weight_colname)));
— 分配采样状态
samplestate = (WeightedSampleData *) palloc0(sizeof(WeightedSampleData));
samplestate->percent = percent;
samplestate->seed = seed;
samplestate->weight_attno = weight_attno;
samplestate->total_weight = 0;
samplestate->num_tuples = 0;
samplestate->current_index = 0;
— 初始化TableSampleData
samplestate->tsdata.type = T_WeightedSample;
tsdesc->tsdata = &samplestate->tsdata;
— 先扫描一遍表,收集所有元组和权重
samplestate->scan = heap_beginscan(tsdesc->heaprel,
tsdesc->rs_snapshot,
0, NULL);
— 第一遍:计数
while ((tuple = heap_getnext(samplestate->scan, ForwardScanDirection)) != NULL)
samplestate->num_tuples++;
heap_endscan(samplestate->scan);
— 分配数组
samplestate->weights = (double *) palloc(samplestate->num_tuples * sizeof(double));
samplestate->tids = (ItemPointerData *) palloc(samplestate->num_tuples * sizeof(ItemPointerData));
— 第二遍:收集权重和TID
samplestate->scan = heap_beginscan(tsdesc->heaprel,
tsdesc->rs_snapshot,
0, NULL);
i = 0;
while ((tuple = heap_getnext(samplestate->scan, ForwardScanDirection)) != NULL)
{
weight_datum = heap_getattr(tuple, weight_attno,
RelationGetDescr(tsdesc->heaprel),
&isnull);
if (isnull)
samplestate->weights[i] = 0;
else
samplestate->weights[i] = DatumGetFloat8(weight_datum);
samplestate->total_weight += samplestate->weights[i];
samplestate->tids[i] = tuple->t_self;
i++;
}
heap_endscan(samplestate->scan);
samplestate->scan = NULL;
PG_RETURN_VOID();
}
— 其他函数实现类似…
— …
PG_FUNCTION_INFO_V1(weightedsample_init);
— 注册其他函数…
4.3 表采样方法优化
表采样方法的优化:
#include “common/pg_prng.h”
— 在初始化时设置随机种子
samplestate->rand_state = (pg_prng_state *) palloc(sizeof(pg_prng_state));
pg_prng_seed(samplestate->rand_state, seed);
— 使用更好的随机数
double r = pg_prng_double(samplestate->rand_state);
— 优化2:批量采样
Datum
simplesample_next(PG_FUNCTION_ARGS)
{
TableSampleDesc tsdesc = (TableSampleDesc) PG_GETARG_POINTER(0);
SimpleSampleData *samplestate = (SimpleSampleData *) tsdesc->tsdata;
HeapTuple tuple;
bool found = false;
int batch_size = 100;
int i;
elog(INFO, “Getting next sample (batch)”);
if (samplestate->scan == NULL)
{
samplestate->scan = heap_beginscan(tsdesc->heaprel,
tsdesc->rs_snapshot,
0, NULL);
}
— 批量获取元组
for (i = 0; i < batch_size && !found; i++) { tuple=heap_getnext(samplestate->scan,
ForwardScanDirection);
if (tuple == NULL)
PG_RETURN_BOOL(false);
if (samplestate->percent > 0 &&
(double) (random() % 100) < samplestate->percent)
{
tsdesc->tuple = tuple;
found = true;
}
}
PG_RETURN_BOOL(found);
}
— 优化3:使用索引加速
— 如果表有合适的索引,可以使用索引扫描代替全表扫描
— 这需要更复杂的实现,包括索引访问方法的集成
Part05-风哥经验总结与分享
5.1 开发最佳实践
自定义表采样方法的开发最佳实践:
- 代码组织:
- 遵循PostgreSQL编码规范
- 合理划分函数职责
- 添加详细的注释
- 使用有意义的变量名
- 随机性:
- 使用PostgreSQL内置的随机数生成器
- 确保种子的正确处理
- 测试随机性的质量
- 考虑可重复性
- 性能优化:
- 避免不必要的全表扫描
- 合理使用缓存
- 优化内存使用
- 考虑批量处理
- 错误处理:
- 完善的参数验证
- 合理的错误报告
- 资源清理
- 内存管理
- 测试:
- 单元测试
- 随机性测试
- 性能测试
- 回归测试
5.2 常见问题与解决方案
自定义表采样方法的常见问题与解决方案:
- 问题:编译错误
解决方案:检查PostgreSQL版本兼容性,确保头文件路径正确 - 问题:运行时崩溃
解决方案:添加错误检查,使用调试工具,检查内存管理 - 问题:随机性不好
解决方案:使用更好的随机数生成器,检查种子处理 - 问题:性能差
解决方案:优化算法,减少全表扫描,使用批量处理 - 问题:采样结果不符合预期
解决方案:检查采样逻辑,验证参数处理,测试不同场景
5.3 企业级应用建议
企业级自定义表采样方法的应用建议:
- 需求评估:
- 充分评估是否真的需要自定义采样
- 考虑内置采样方法的扩展性
- 评估开发和维护成本
- 开发流程:
- 先从简单的原型开始
- 逐步完善功能
- 充分测试后再上线
- 维护:
- 定期更新以兼容新的PostgreSQL版本
- 收集使用反馈持续优化
- 建立文档和知识库
- 风险控制:
- 在测试环境充分验证
- 准备回滚方案
- 监控使用情况和性能
更多视频教程www.fgedu.net.cn
学习交流加群风哥微信: itpux-com
学习交流加群风哥QQ113257174
风哥提示:自定义表采样方法是PostgreSQL强大扩展性的体现,适用于需要特定采样策略的场景,建议优先考虑使用或扩展内置采样方法。
更多学习教程公众号风哥教程itpux_com
from PostgreSQL视频:www.itpux.com
本文由风哥教程整理发布,仅用于学习测试使用,转载注明出处:http://www.fgedu.net.cn/10327.html
